# Introduction to Bayesian Inference
in this jupyter notebook our goal is to present the basic of bayesian inference and to see how we can apply it to different situations. This notebook contains a small introduction to bayesian inference, however I still have a lot to learn in the field so this presentation is also, for me, to make sure that I understood the principles

## Bayesian statistics 
Bayesian statiscs aims at computing the probability of differents parameters at the light of some observation: we mesure the impact of observed data on our assumptions concerning these parameters

Bayesian statistics are relying mostly on the Bayes formula:
$$ 
P(A|B)=\frac{P(B|A)P(A)}{P(B)}
$$
This formula means that observing the effect of A on B can give us some information on the probability of A if we know the possibilities of A and B.

If you are confused (I was definitly confused at this moment) here is an example from wikipedia (french page) that I suggest we resolve:

### Laplace: sex ratio in population

Laplace observe in year 1785 a discrepancy between the birth of boys and girls, respectively $N_1 = 251 527$ and $N_2=241 945$ . He want to determine if the this difference is showing some biased probability $\theta$ of having a boy.

Assuming nothing on $\theta$ (ie $\theta$ follows a uniform law between 0 and 1) he want to compute

$$
P(\theta \leq \frac{1}{2}|(N_1,N_2))
$$

Using Bayes formula:
$$
P(\theta|(N_1,N_2))=\frac{P((N_1,N_2)|\theta)P(\theta)}{P((N_1,N_2))}
$$

Ok so the easiest term in the previous equation is $P(\theta)$ as we assumed $\theta$ to follow a uniform law between 0 and 1 therefore $P(\theta)=\frac{1}{1-0}=1$

$P((N_1,N_2)|\theta)$ is $N_1$ draws with probability $\theta$  and $N_2$ draws with probability $(1-\theta)$ therefore:
$$
P((N_1,N_2)|\theta)=\theta^{N_1}(1-\theta)^{N_2}
$$

$P((N_1,N_2))$ is equal to $P((N_1,N_2)|\theta)$ summed over all possible values of $\theta$:

$$
P((N_1,N_2))=\int_0^1 P((N_1,N_2)|\theta)d\theta = \int_0^1 \theta^{N_1}(1-\theta)^{N_2} d\theta = B(N_1+1,N_2+1)
$$

With B the [beta function](https://en.wikipedia.org/wiki/Beta_function)

In the end we have:

$$
P(\theta|(N_1,N_2))=\frac{\theta^{N_1}(1-\theta)^{N_2}}{B(N_1+1,N_2+1)}
$$

by definition:

$$
P(\theta \leq \frac{1}{2}|(N_1,N_2))= \int_0^{\frac{1}{2}} P(\theta|(N_1,N_2))d\theta =  \frac{\int_0^{\frac{1}{2}}\theta^{N_1}(1-\theta)^{N_2}d\theta}{B(N_1+1,N_2+1)}= I_{\frac{1}{2}}(N_1+1,N_2+1)
$$

with $I_x(a,b)$ being the [regularized incomplete beta function](https://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function)

Lucky for us this regularized incomplete beta function is implemented in scipy












In [5]:
# the function betainc here is not the incomplete beta function but the regularized incomplete beta function
# check out the docs of scipy: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.betainc.html
# it is normalised using the beta function in the form of a gamma function ratio
from scipy.special import betainc
N1 = 251528
N2 = 241946
x = 0.5
betainc(N1,N2,x)

1.1460584897220391e-42

The probability of having a proportion of birth of boys inferior to 0.5 is equal to $1.15 \times 10^{-42}$ in the light of such observation. We can call it unlikely

## Bayesian inference
In Machine learning we are often facing the problem of finding the "best" set of parameters of a parametric function to minimize the Mean Square Error however this set of parameter can not be the optimum that is the most robust. Let $\theta \in \mathbb R^N$ the vector of parameters that we would like to find, $(X,Y)=D$ our data.
We would like to have the following quantity:
$$
P(\theta|D)
$$
Meaning the probability of our parameters once we observed the data.


Using Bayes formula:
$$
P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}
$$

$P(D|\theta)$ is the likelyhood of the data, $P(\theta)$ is our prior belief of the values of the parameters and $P(D)$ is the likelyhood summed over all the parameters

Of course if $N=2$ the integral denominator is quite easy, however the integral becomes much more dificult in the case of a higly dimentional $\theta$ however the denominator is a constant of theta therefore we can write:

$$
P(\theta|D) \propto P(D|\theta)P(\theta)
$$

And you'll see: that's just about what we need!
