# Posterior derivation for coin-flipping

## I. Posterior derivation for discrete uninformed prior

Let us fix a uniform grid of potential bias values $p\in\{0.0, 0.1,\ldots, 1.0\}$. Then the prior probability is given by a formula

\begin{align}
\Pr[p=0.0]&=\frac{1}{11}\\
\Pr[p=0.1]&=\frac{1}{11}\\
&\cdots
\enspace.
\end{align}

The likelihood of seeing $k$ ones out of $n$ throws is given by binomial formula

\begin{align}
\Pr[k|n,p]=\binom{n}{k}p^k(1-p)^{n-k}
\end{align}

and thus the Bayes formula gives

\begin{align}
\Pr[p|n,k]=\frac{\Pr[k|n,p]\cdot \Pr[p]}{\Pr[k|n]}
\end{align}

where the probability $\Pr[k|n]$ is a apriori probability estimate that we see $k$ ones out of $n$ throws

\begin{align}
\Pr[k|n]=\sum_{p}\Pr[k,p|n]=\sum_p \Pr[k|p,n]\Pr[p]
\end{align}

does not depend on the bias value $p$. Therefore, the probability $\Pr[p|n,k]$  as a function of $p$ is fixed up to an unknown constant as $\Pr[k|n,p]\cdot \Pr[p]$. We use the following shorthand to emphasise this fact

\begin{align*}
\Pr[p|n,k]\propto\Pr[k|n,p]\cdot \Pr[p]\propto p^k(1-p)^{n-k}\enspace.
\end{align*}



## II. Posterior derivation for informed prior

Let us assume that we know that $p\in[0.1,0.2]$ for the same uniform grid. Then it is straightforward to see that 

\begin{align}
\Pr[p]\propto
\begin{cases}
1, &\text{if } p\in[0.1,0.2]\\
0, &\text{otherwise}
\end{cases}
\end{align}

and we get

\begin{align*}
\Pr[p|n,k]\propto 
\begin{cases}
p^k(1-p)^{n-k}, &\text{if } p\in[0.1,0.2]\\
0, &\text{otherwise}\enspace.
\end{cases}
\end{align*}

That is the formula is the same but restricted to the valid bias values. The latter does not mean that the posterior distributions for uninformed and informed case are the same. The unknown normalising coefficient is different for both distributions and thus the concrete values are different. 

## III. Hidden assumptions behind uninformed prior

Defining an uninformed prior is not so straightforward as the choice of the grid points also encodes our prefereces even if we assign the same prior probability to each model.   

As an example consider an uneven grid $p\in\{0.0, 0.1, 0.101, 0.102, \ldots, 0.200, 0.3,\ldots, 1.0\}$. Then it is straighforward to see that 

\begin{align}
\Pr[p\in[0.1,0.2]]&=c\cdot 102\\
\Pr[p\notin[0.1,0.2]]&=c\cdot 9 
\end{align}

and thus we have implicitly declared that $\Pr[p\in[0.1,0.2]]$ is much more probable than any other parameter value.    


This leads to a philosphical problem. What makes the uniform grid different from the non-uniform grid? There is no easy answers to it. The problem becomes even harder if the model has many alternative parametrisations. Then uniform grid in one parametrisation does not have to be uniform in the other parameterisation and we need to argue that a particular parametrisation is more natural than the others such as

\begin{align}
pr[X=1]=p^2\enspace.
\end{align}

For a coin-flipping problem the natural bias paramater $p$ is the probability of heads but for other problems the question of the most natural parametrisation is non-trivial. Information geometry gives a partial answer to it.   

## IV. Posterior derivation for uninformed continous prior

There are several ways to derive the posterior. We can consider the uniform grid with width $\Delta p\to 0$. Then $\Pr[p|k,n]\propto p^k(1-p)^{n-k}$ as before but the normalising constant changes

\begin{align}
c(\Delta p)=\sum_{p\in\{0,\Delta p,\ldots,1\}} p^k(1-p)^{n-k}/ \left(\frac{1}{\Delta p}+1\right)\approx \sum_{p\in\{0,\Delta p,\ldots,1\}} p^k(1-p)^{n-k} \approx
\int_0^1 p^k(1-p)^{n-k}dp\enspace.
\end{align}

where the approximation precision becomes better and better in the process of $\Delta p\to 0$.

One can take the integral analytically and get the correct result

\begin{align*}  
p[p|k,n] = \frac{\Gamma(n+2)}{\Gamma(k+1)\Gamma(n-k+1)}\cdot p^k(1-p)^{n-k}\enspace.
\end{align*}




## V. Maximum posterior estimate

If you know the posterior and can make only one guess for the bias parameter then the obvious choice is the bias value with the highest poosterior probability. Again, we can ignore constants and maximize 

\begin{align}
F(p) = p^k(1-p)^{n-k}\enspace.
\end{align}


The standard thechnique for maximum search is to find the derivative and equate it with zero:

\begin{align}
\frac{\partial F}{\partial p}= \frac{\partial p^k}{\partial p} (1-p)^{n-k} + \frac{\partial (1-p)^{n-k}}{\partial p}  p^k=0\enspace.
\end{align}

The latter is doable but very technical. Hence, we observe a logarithm of the posterior.
As the logarithm is monotone function the latter does not change the locations of maxima and minima -- by taking a logarithm we stretch and squeeze the $y$-axis wich deforms the functions but preserves the location of peaks and valleys. 

\begin{align}
\log F &=\log(p^k(1-p)^{n-k}) \\
&= k \log(p) + (n-k)\log(1-p)\\
\frac{\partial \log F}{\partial p}&= 
k \cdot \frac{1}{p} + (n-k)\cdot \frac{1}{1-p}\cdot(-1) \\
&= \frac{(1-p)k -(n-k)p}{p(1-p)}=\frac{k - np}{p(1-p)} 
\end{align}

From which we can derive

\begin{align}
p=\frac{k}{n}
\end{align}

The latter is well known classical estimator for the probability.

## VI. Beta distribution as uninformed continous posterior

A distribution with the density function

\begin{align*}  
p[p|k,n] = \frac{\Gamma(n+2)}{\Gamma(k+1)\Gamma(n-k+1)}\cdot p^k(1-p)^{n-k}\enspace
\end{align*}

is known as **beta distribution**. Beta distribution is classically parametrised with parameters 

\begin{align*}
\alpha&=k+1\\
\beta&=n-k+1
\end{align*}

and thus
\begin{align*}  
p[p|\alpha,\beta] = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\cdot p^{\alpha-1}(1-p)^{\beta-1}\enspace.
\end{align*}

As we are not interested in the mathematical beauty of the density functions we still use $n$ and $k$ as alternative parametrisation.

Beta distribution is the continous posterior of an uninformed person about the bias of the coin who observes $k$ heads out of $n$ independent throws of the same coin. 









## VII. Conjugate priors and Laplace smoothing

Let us now consider an experiment where we first observe $k_1$ heads out of $n_1$ throws and then observe another experiment with the same coin with $k_2$ heads out of the $n_2$ throws. Then there are two ways we can arrive to the posterior.

### Single posterior update

Two observations jointly can be considered as an experiment where we get $k_1+k_2$ heads out of $n_1+n_2$ throws. If we use uninformative prior then we get a beta distribution

\begin{align}
p[p|k=k_1+k_2, n=n_1+n_2]\propto p^{k_1+k_2}(1-p)^{n_1+n_2-k_1-k_2}\enspace.
\end{align}


### Iterative posterior update

Alternatively we can think of it as two step experiment where we first observe $k_1$ heads out of $n_1$ throws and thus get a posterior

\begin{align}
p[p|k=k_1, n=n_1]\propto p^{k_1}(1-p)^{n_1-k_1}\enspace.
\end{align}

To analyze the second step we need to fix a prior to the coin bias. Given our knowlegde about the previous experiment, this prior must be equal to the posterior of the first experiment. We have not received additional information and thus our uncertainty must remain the same. From this we can conclude

\begin{align}
p[p|k=k_1+k_2, n=n_1+n_2]&\propto p^{k_2}(1-p)^{n_2-k_2}\cdot p^{k_1}(1-p)^{n_1-k_1}\\
&\propto p^{k_1+k_2}(1-p)^{n_1+n_2-k_1-k_2}\enspace.
\end{align}

The result coincides with the first derivation. It has to be or otherwise the theory would not be **internally consitent**. We shpuld always get the same result regardles the way we decompose our problem. 

### Iterative belief updates and conjugate priors

In many cases, we the observations arrive to us in small packets and we need to apply the Bayes rule 

\begin{align}
p[\text{Parameters}|\text{Data}]\propto p[\text{Data}|\text{Parameters}]\cdot p[\text{Parameters}]
\end{align}

several times. On the left side we product of two distributions. Normally both of them are fixed by the small set of parameters:

\begin{align}
p[\text{Parameters}|\text{Data}]\propto p[\text{Data}|\boldsymbol{\alpha}]\cdot p[\text{Parameters}|\boldsymbol{\beta}]\enspace.
\end{align}

For coinflipping, we have $p[\text{Data}|\boldsymbol{\alpha}]$ follows Binomial distribution and $p[\text{Parameters}|\boldsymbol{\beta}]$ follows beta distribution. 

In general, it is not guarateed that the resulting posterior distibution is a nice parametric distribution and we need to do a lot of work to get the (approximate) end result. 

**Ideally** the posterior distribution comes form the same parametric distribution class as the prior. If this is the case, we need to define only the update formulae for posterior distribution parameters $\boldsymbol{gamma}$ in term of likelihood and prior parameters:

\begin{align}
\boldsymbol{\gamma}=f(\boldsymbol{\alpha}, \boldsymbol{\beta}).
\end{align}

When this ideal holds one says that the prior distribution is **conjugate** to the likelihood distribution. By our results beta distribution is conjugate prior to binomial distribution. 







## Laplace smoothing

Uninformed prior is not good for estimating coin bias in the practice as the maximum aposteriori estimate 

\begin{align}
p=\frac{k}{n}
\end{align}

can be zero or one. Zero-one estimates are bad as they make you overconfident. In practice they crash other inference algorithms by casusing division by zero errors. 

The problem lies in the fact that it is possible to observe exactly $0$ or $n$ heads in the cointossing experiment. 
To circumvent this we could add virtual observations. We could pretend that prior to the actual experiment we saw two coinflips: one head and one tail. 

More formally we fix a beta distribution as a prior distribution. As beta distrinution is a conjugate prior to binomial distribution we get a beta distribution as a posterior with 

\begin{align}
p=\frac{k+1}{n+2}
\end{align}

as the maximum aposteriori estimate. This is always guaranteed to be nonzero and does not create numerical problems.


The choice of the virtual observations is subjective and we can use any values. In general Laplace smoothing is given by a formula

\begin{align}
p=\frac{k+\alpha}{n+2\alpha}\enspace.
\end{align}




