# Exercise 8 - Basic Samplers

In this exercise, we will build samplers to generate samples from categoricals and gaussians.

In the event of a persistent problem, do not hesitate to contact the course instructors under
- paul.kahlmeyer@uni-jena.de

### Submission

- Deadline of submission:
        08.01.2023
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=34630)

### Help
In case you cannot solve a task, you can use the saved values within the `help` directory:
- Load arrays with [Numpy](https://numpy.org/doc/stable/reference/generated/numpy.load.html)
```
np.load('help/array_name.npy')
```
- Load functions, classes and other objects with [Dill](https://dill.readthedocs.io/en/latest/dill.html)
```
import dill
with open('help/some_func.pkl', 'rb') as f:
    func = dill.load(f)
```

to continue working on the other tasks.

# Sampling from Distributions

Sampling from a distribution is mostly done by mapping the unit interval $[0, 1]$ onto the sampling space of the distribution. If we then have a sampler on the unit interval, we can map these samples onto the target sample space.

How to sample from $[0,1]$ was the content of the last exercise. In this exercise we will leave sampling from $[0, 1]$ to [Numpy](https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html).

## Categorical

Transforming samples from $[0, 1]$ into categories of a categorical with probability vector $p$ is done with **Inverse Transform Sampling**:

We split the interval into bins of size $p(x=i)$ for each category $i$.
Then we can assign each $u\in[0, 1]$ to the category of the bin it falls into. 

The following figure illustrates this process for 4 categories. The blue balls are samples in $[0, 1]$ and fall into bins that are of the width of the respective category. 
<div>
<img src="images/its.png" width="600"/>
</div>



### Task 1

Implement the inverse transform sampling for categoricals.

Use this function to sample from a categorical with 
\begin{equation}
p = [0.1, 0.2 , 0.1, 0.15 , 0.13, 0.32]^T
\end{equation}

and compare the true $p$ to the maximum likelihood estimate based on an increasing number of samples over 10 tries.

For visualization you can use the calculation of a mean confidence interval in `utils.py`.

In [1]:
import numpy as np
import utils
def sample_categorical(p: np.ndarray, N:int = 1) -> np.ndarray:
    '''
    Samples from a categorical. 
    
    @Params:
        p... probability vector
        N... number of samples
        
    @Returns:
        samples from a categorical distribution
    '''
    p_cumulative = np.array([p[0:k].sum() for k in range(1,p.size+1)])
    unit_samples = np.random.random_sample(N)
    samples = []
    for unit_sample in unit_samples:
        for idx,breakpoint in enumerate(p_cumulative):
            if unit_sample <= breakpoint: 
                samples.append(idx)
                break
    return np.array(samples)


p_true = np.array([0.1, 0.2 , 0.1, 0.15 , 0.13, 0.32])
num_samples = [5, 10, 100, 500, 1000, 5000, 10000, 50000, 100000, 1000000]
print(f'True p: {p_true}')
for n in num_samples:
    samples = sample_categorical(p_true, n)
    p_estimate = np.histogram(samples, range(6+1))[0] / n
    print(f'Estimated p with {n} samples: {p_estimate}')

True p: [0.1  0.2  0.1  0.15 0.13 0.32]
Estimated p with 5 samples: [0.  0.4 0.  0.4 0.  0.2]
Estimated p with 10 samples: [0.  0.1 0.2 0.  0.2 0.5]
Estimated p with 100 samples: [0.14 0.22 0.03 0.13 0.14 0.34]
Estimated p with 500 samples: [0.112 0.2   0.09  0.142 0.158 0.298]
Estimated p with 1000 samples: [0.09  0.2   0.094 0.178 0.117 0.321]
Estimated p with 5000 samples: [0.1024 0.1904 0.1032 0.152  0.132  0.32  ]
Estimated p with 10000 samples: [0.1055 0.1984 0.0906 0.1557 0.1297 0.3201]
Estimated p with 50000 samples: [0.09784 0.20014 0.10148 0.14954 0.13258 0.31842]
Estimated p with 100000 samples: [0.09878 0.19987 0.09976 0.15067 0.12932 0.3216 ]
Estimated p with 1000000 samples: [0.100032 0.200124 0.099985 0.149784 0.130162 0.319913]


## Standard Normal Distribution

The  [Box Muller Transform](https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform) is a method to transform samples $u, v \in[0, 1]^2$ into samples $x_0, x_1$ from a standard normal distribution $\mathcal{N}(0, 1)$. 

This is done by treating $u$ and $v$ as probabilities for sampling a point $x_1, x_2$ of a bivariate standard normal distribution with an angle $\theta$ and a distance $r$ from the origin. 

<div>
<img src="images/boxmuller.png" width="200"/>
</div>

We transform $u$ and $v$ to actual $\hat{\theta}$ and $\hat{r}$ by using the [inverse cumulative density function](https://en.wikipedia.org/wiki/Quantile_function):

1. The angle $\theta$ is uniformly distributed in $[0, 2\pi]$.\
We transform $u$ to $\hat{\theta}$ using  $p\left(\theta < \hat{\theta}\right) = \text{cdf}(\hat{\theta})= u$.
2. The half squared distance $\frac{r^2}{2}$ from the origin is exponentially distributed with $\lambda = 1$.\
We transform $v$ to $\frac{\hat{r}^2}{2}$ using $p\left(\frac{r^2}{2} < \frac{\hat{r}^2}{2}\right) = \text{cdf}\left(\frac{\hat{r}^2}{2}\right)= v$


### Task 2

Derive the formulas to calculate $\hat{\theta}$ and $\hat{r}$ from $u$ and $v$.

Based on that, calculate $x_0$ and $x_1$ from $u = 0.5, v = 0.2$.

#### Formula for $\hat{\theta}$
$
\begin{align*}
    u &= p\left(\theta < \hat{\theta}\right) \\
    u &= \int_0^{\hat{\theta}} \frac{1}{2\pi} d\theta \\
    u &= \frac{\hat{\theta}}{2\pi} \\
    \hat{\theta} &= 2u\pi
\end{align*}
$

#### Formula for $\hat{r}$
The cdf for the exponential distribution is $\text{cdf}(x) = 1 - e^{-\lambda x}$

$
\begin{align*}
    v &= p\left(\frac{r^2}{2} < \frac{\hat{r}^2}{2} \right) \\
    v &= 1 - e^{-\frac{\hat{r}^2}{2}} \\
    e^{-\frac{\hat{r}^2}{2}} &= 1 - v \\
    -\frac{\hat{r}^2}{2} &= \log(1-v) \\
    \hat{r} &= \sqrt{-2 \log(1-v)}
\end{align*}
$

In [2]:
import math
u, v = 0.5, 0.2
theta = 2 * u * math.pi
r = math.sqrt(-2* math.log(1-v))
x0 = r * math.cos(theta)
x1 = r * math.sin(theta)
x0,x1

(-0.6680472308365775, 8.181219029232676e-17)

### Task 3

Implement the sampling process for the standard normal distribution.

Sample 10000 samples.and check the hypothesis that they are distributed according to a standard normal distribution using the [Kolmogorov- Smirnov Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) .

We choose a confidence level of 95%; that is, we will reject the null hypothesis (our data is standard normal distributed) in favor of the alternative if the p-value is less than 0.05.

In [3]:
from scipy.stats import kstest, norm
def sample_standard_normal(N:int=1) -> np.ndarray:
    '''
    Samples from a univariate standard normal distribution using Box-Muller transform.
    
    @Params:
        N... number of samples
        
    @Returns:
        Samples from a standard normal distribution
    '''
    unit_samples_u = np.random.random_sample(N)
    unit_samples_v = np.random.random_sample(N)
    normal_samples = []
    for (u,v) in zip(unit_samples_u, unit_samples_v):
        theta = 2 * u * math.pi
        r = math.sqrt(-2* math.log(1-v))
        sample = r * math.cos(theta)
        normal_samples.append(sample)
    return np.array(normal_samples)

kstest(sample_standard_normal(10000), norm.cdf)
# p-value > 0.05 => data is standard normal distributed

KstestResult(statistic=0.008011850160992151, pvalue=0.5395197849329181)

## Arbitrary Gaussian
Now we want to sample from an arbitrary, multivariate gaussian with mean vector $\mu$ and covariance matrix $\Sigma$.


Let $X = (X_1,\dots, X_n)$ be a vector of random variables $X_i$, whose joint density is a **standard** multivariate Gaussian. That is $X\sim\mathcal{N}(0, \mathbb{1}_n)$.

Then for a given mean vector $\mu$ and a convariance matrix with [Cholesky decomposition](https://numpy.org/doc/stable/reference/generated/numpy.linalg.cholesky.html) $\Sigma = AA^T$ we can transform $X$ into 
\begin{equation}
Y = AX + \mu
\end{equation}
and $Y\sim\mathcal{N}(\mu, \Sigma)$
### Task 4

Implement the sampling routine to sample from a multivariate Gaussian with given mean and covariance matrix.

Similar to Task 1, use this function to sample from a Gaussian with 
\begin{align}
\mu &= [5, 2]^T\\
\Sigma &= \begin{bmatrix}
2.5&1.65\\
1.65&1.93
\end{bmatrix}
\end{align}

and compare the true $\mu$ and $\Sigma$ to the maximum likelihood estimates based on an increasing number of samples over 10 tries.

For visualization you can use the calculation of a mean confidence interval in `utils.py`.

In [6]:
def sample_normal(mean:np.ndarray, cov:np.ndarray, N:int=1)-> np.ndarray:
    '''
    Samples from a multivariate normal distribution.
    
    @Params:
        mean... mean vector of gaussian
        cov... covariance matrix
        N... number of samples
        
    @Returns:
        samples from an arbitrary normal distribution
    '''
    num_dim = mean.shape[0]
    standard_samples = sample_standard_normal(N)
    for dim in range(1,num_dim):
        standard_samples = np.vstack([standard_samples, sample_standard_normal(N)])
    A = np.linalg.cholesky(cov)
    return A @ standard_samples + np.atleast_2d(mean).T


mean_true = np.array([5, 2])
cov_true = np.array([[2.5 , 1.65], [1.65, 1.93]])
num_samples = [5, 10, 100, 500, 1000, 5000, 10000, 50000, 100000, 1000000]
print(f'True mean: {mean_true}')
print(f'True covariance: {cov_true}')
print()
for n in num_samples:
    print(f'{n} samples:')
    samples = sample_normal(mean_true, cov_true, n)
    mean_estimate = np.mean(samples, axis=1)
    cov_estimate = np.cov(samples)
    print(f'Estimated mean: {mean_estimate}')
    print(f'Estimated covariance: {cov_estimate}')
    print()

True mean: [5 2]
True covariance: [[2.5  1.65]
 [1.65 1.93]]

5 samples:
Estimated mean: [4.64010487 1.911497  ]
Estimated covariance: [[1.75288859 1.45266739]
 [1.45266739 2.02516743]]

10 samples:
Estimated mean: [4.56563275 2.27080314]
Estimated covariance: [[0.713159   0.94898761]
 [0.94898761 1.47254133]]

100 samples:
Estimated mean: [4.96045534 2.13732045]
Estimated covariance: [[2.80811182 2.01871447]
 [2.01871447 2.3630396 ]]

500 samples:
Estimated mean: [5.01752009 2.09367589]
Estimated covariance: [[2.51780755 1.67742278]
 [1.67742278 1.99091663]]

1000 samples:
Estimated mean: [5.04082347 2.04953006]
Estimated covariance: [[2.39421258 1.48347235]
 [1.48347235 1.76625033]]

5000 samples:
Estimated mean: [4.97621294 1.98308257]
Estimated covariance: [[2.49590005 1.66967687]
 [1.66967687 1.94552181]]

10000 samples:
Estimated mean: [5.0137724  2.02189727]
Estimated covariance: [[2.50908305 1.66376901]
 [1.66376901 1.92218696]]

50000 samples:
Estimated mean: [5.00138155 1.997