# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your GT login and the GT logins of any of your collaborators below. (The GT logins are worth 1 point per notebook, so don't miss the opportunity to get a free point!)

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Output analysis

You've built a simulator, which is running correctly and producing outputs. Now what?

Some handy resources for today's notebook:
* For a discussion of output analysis, see Chapter 8 of Lemmis & Park (2006): https://t-square.gatech.edu/access/content/group/gtc-239f-fc11-5690-9dae-2dc96b59f372/Lemmis-Park-2006--des-first-course.pdf
* For a list of Numpy's random number generators, see: http://docs.scipy.org/doc/numpy/reference/routines.random.html
* For advice and code on making histograms, see: http://matplotlib.org/1.2.1/examples/pylab_examples/histogram_demo.html

In [None]:
import numpy as np
from math import sqrt

np.set_printoptions (linewidth=100, precision=2)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

%matplotlib inline

from ipywidgets import interact

## Central Limit Theorem

Let's start by checking the Central Limit Theorem (CLT) experimentally.

Suppose we have built a simulator that produces a single output value each time it runs. By the argument from class, suppose we may assume that the output of each run $i$ is a random variable $Y_i$, where all $\{Y_i\}$ random variables are independent and identically distributed with some true underlying mean $\mu$ and variance $\sigma^2$.

Here is a hypothetical simulator that obeys these assumptions, at least approximately. In particular, it simply draws output values from an exponential distribution with mean $\mu$, i.e., $Y_i \sim \mathcal{E}(\mu)$. As it happens, the variance of such a distribution is $\sigma^2 = \mu^2$.

In [None]:
MU_TRUE = 1.0

def fake_simulator (mu=MU_TRUE):
    """
    Pretends to simulate some process that produces
    a single output value.
    """
    return np.random.exponential (mu)

VAR_TRUE = MU_TRUE * MU_TRUE

# Demo
fake_simulator ()

The mean of these $n$ runs is another random variable, $\bar{Y}$, where

$$\begin{eqnarray}
  \bar{Y}_n & \equiv & \dfrac{1}{n} \sum_{i=0}^{n-1} Y_i.
\end{eqnarray}$$

The _Central Limit Theorem_ tells us that the mean will be distributed normally (i.e., as a Gaussian) with mean and variance given by

$$\begin{eqnarray}
  \bar{Y}_n & \sim & \mathcal{N}\left(\mu, \dfrac{\sigma^2}{\sqrt{n}}\right)
\end{eqnarray}$$

as the number of samples $n \rightarrow \infty$. In other words, the mean of $n$ runs will tend toward the true mean with less and less uncertainty as $n$ increases.

**Exercise 1** (2 points). Complete the following function, which conducts a given number of "experiments," where each experiment is a single run of a given simulator. It returns an array containing the outputs of all these experiments.

In [None]:
def do_experiments (simulator, num_experiments):
    """
    This function repeatedly calls a simulator and records the outputs.
    The simulator must be a function, `simulator()`, that returns a
    single floating-point output value. This function will call the
    simulator `num_experiments` times and return all outputs.
    """
    assert hasattr(simulator, '__call__') # `simulator` must be a function
    Y = np.zeros (num_experiments)
    
    # Run the given simulator and record outputs in Y[:]
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return Y

# Demo
n_e = 10000
Y = do_experiments (fake_simulator, n_e)
print ("n_e =", n_e, "==>", np.mean (Y))

In [None]:
assert abs (np.mean (Y) - MU_TRUE) / MU_TRUE <= 0.03

**Exercise 2** (2 points). Complete the following function, which runs batches of experiments. Each batch consists of running the simulator a given number of times; the number of batches is also given. For each batch, your implementation should record the mean of the simulator runs. It should then return all of those means.

In [None]:
def repeat_experiments (simulator, num_experiments, num_batches):
    """
    This function repeats a batch of simulation experiments many times,
    returning the means of each batch.
    
    It uses `do_experiments()` to run one batch of experiments, and
    repeats batch runs `num_batches` times.
    """
    Y_bar = np.zeros (num_batches) # Stores the means of each batch
    
    # Run batches and record means
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return Y_bar

# Demo
n_b = 10 # Number of batches
NE = [10, 100, 1000, 10000]
YB = []
for n_e in NE:
    YB.append (repeat_experiments (fake_simulator, n_e, n_b))
    print (n_e, "=>", YB[-1])

In [None]:
assert (abs (YB[0] - MU_TRUE) / MU_TRUE <= 0.875).all ()
assert (abs (YB[1] - MU_TRUE) / MU_TRUE <= 0.375).all ()
assert (abs (YB[2] - MU_TRUE) / MU_TRUE <= 0.125).all ()
assert (abs (YB[3] - MU_TRUE) / MU_TRUE <= 0.0625).all ()

In [None]:
# Another demo, which plots the means of all batches for varying
# numbers of experimental trials per batch.

fig = plt.figure (figsize=(16, 6))
ax = fig.add_subplot (111)

n_b = 100 # Number of batches
for n_e in NE:
    x = np.arange (n_b)
    y = repeat_experiments (fake_simulator, n_e, n_b)
    ax.plot (x, y, '*-', label=str (n_e))
    plt.xlabel ('Batch number')
    plt.title ('Sample mean during each batch')
    
ax.legend ()

**Exercise 3** (2 points). Create an interactive widget using `interact()` to verify the behavior of the CLT using the fake simulator (`fake_simulator()`). Your widget should include the following:

- It should allow the user to vary both the number of batches and the number of experiments per batch.
- It should draw a histogram of $\bar{y}$ from all the batches.
- It should add a curve for the theoretical normal distribution that we would expect the histogram to reflect.

Here is a sample of what your widget might look like.

![Sample widget.](https://github.com/rvuduc/cx4230sp17labs/raw/master/lab5/example-widget.png)

In [None]:
def viz_exp (num_experiments=100, num_repetitions=100):
    """
    Runs many batches of "fake" experiments. Plots a
    histogram and adds a best-fit Gaussian to the plot.
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    
# Demo
x = interact (viz_exp
              , num_experiments=(100, 2000, 100)
              , num_repetitions=(10, 100, 10)
             )

## $t$-test

Suppose you run the simulation $n$ times, observing the output values $y_0$, $y_1$, $\ldots$, $y_{n-1}$. From these observations you then compute the _sample mean_,

$$\begin{eqnarray}
  \bar{y}_n & \leftarrow & \dfrac{1}{n} \sum_{i=0}^{n-1} y_i.
\end{eqnarray}$$

Since you only have one realization of experiments (i.e., one set of observations), this sample mean is a _point estimate_. How close is this point estimate to the true mean?

Consider the following test statistic, which is sometimes also referred to as the _$t$-statistic_, denoted $t_n$. It is defined in terms of the sample mean ($\bar{y}$) and the _sample variance_, $s_n^2$.

$$\begin{eqnarray}
  s_n^2 & \leftarrow & \dfrac{1}{n} \sum_{i=0}^{n-1} (y_i - \bar{y}_n)^2 \\
  t_n & \equiv & \dfrac{\bar{y}_n - \mu}{s_n \,/\, \sqrt{n-1}}.
\end{eqnarray}$$

Note that $t_n$ is not actually computable in general, as it depends on the _true_ mean, which you don't know. Nevertheless, and quite remarkably, the _distribution_ of $t_n$ _is_ known! In particular, $t_n$ follows [_Student's $t$-distribution_](http://mathworld.wolfram.com/Studentst-Distribution.html), which is parameterized by $n$.

$$\begin{eqnarray}
  t_n & \sim & \mathrm{Student}(n-1).
\end{eqnarray}$$

Moreover, the cumulative distribution function (CDF) of a Student-$t$ random variable is known. Let's call the CDF $F_n(x) \equiv \mathrm{Pr}[t_n \leq x]$. Then, it is possible to compute the probability that $t_n$ falls within some range.

For example, suppose we wish to know the probability that $t_n$ falls between $-x$ and $x$. In terms of the CDF,

$$\begin{eqnarray}
  \mathrm{Pr}[-x \leq t_n \leq x] & = & F_n(x) - F_n(-x).
\end{eqnarray}$$

As it happens, $F_n(x)$ is also symmetric about 0. Therefore, $F_n(x) = 1 - F_n(-x)$ and

$$\begin{eqnarray}
  \mathrm{Pr}[-x \leq t_n \leq x] & = & 2 F_n(x) - 1.
\end{eqnarray}$$

Recall that $t_n$ depends on the true mean, $\mu$, which is unknown. But since the relationship between $t_n$ and $\mu$ _is_ known, you can try to rewrite the probability that $t_n$ falls within some range into an equivalent statement about $\mu$. You would then find,

$$\begin{eqnarray}
  \mathrm{Pr}[-x \leq t_n \leq x]
    & = & \mathrm{Pr}\left[-x \leq \dfrac{\bar{y}_n - \mu}{s_n \,/\, \sqrt{n-1}} \leq x\right] \\
    & = & \mathrm{Pr}\left[\bar{y}_n - \dfrac{s_n}{\sqrt{n-1}} x \leq \mu \leq \bar{y}_n + \dfrac{s_n}{\sqrt{n-1}}x \right]
    & = & 2 F_n(x) - 1.
\end{eqnarray}$$

In other words, the true mean, $\mu$, falls within $\pm \dfrac{x s_n}{\sqrt{n-1}}$ of $\bar{y}_n$ with some probability that can be computed from the CDF.

You can now flip this fact around! That is, you can compute how large a window around $\bar{y}$ you would need to ensure that the probability of the true mean falling in that window is, say, $1 - \alpha$, where you choose $\alpha$ based on your personal tolerance for uncertainty. (A typical value is $\alpha=0.1$.) Then,

$$\begin{eqnarray}
  2 F_n(x) - 1 & = & 1 - \alpha \\
  x & = & F_n^{-1}\left(1 - \dfrac{\alpha}{2}\right).
\end{eqnarray}$$

The interval $\bar{y}_n \pm \dfrac{F_n^{-1}\left(1 - \dfrac{\alpha}{2}\right) s_n}{\sqrt{n-1}}$ is known as the _$(1 - \alpha)$ confidence interval_. For instance, choosing $\alpha=0.1$ yields a 90% confidence interval.

To compute this confidence interval in code, you can use [SciPy's `ppf()`](http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.t.html) function. In particular, to compute $F_n^{-1}(x)$, you would use the following call:

```python
  from scipy.stats import t
  t.ppf (x, n-1)
```

> Observe that we've passed `n-1` rather than `n` into the call to `ppf()`, which is a common statistical convention. (The value `n-1` is referred to as the number of degrees of freedom in the model.)

**Exercise 4** (2 points). Try running 10 simulations and compute the sample mean. Then compute a 95% confidence interval around this sample mean.

> Use the `ppf()` function available in SciPy to invert the CDF: 
>
> The example below computes $F_9^{-1}(0.95)$.

In [None]:
# Example of the inverse CDF

from scipy.stats import t

t.ppf (0.95, 9)

In [None]:
def calc_conf_int (data, alpha):
    """
    Returns the mean and width of a (1-alpha) confidence
    interval. Returns a pair (y_bar, dy) for the
    corresponding y +/- dy confidence interval.
    """
    assert type (data) is np.ndarray
    # YOUR CODE HERE
    raise NotImplementedError()

n_e = 10
Y = do_experiments (fake_simulator, n_e)
(y_bar, dy) = calc_conf_int (Y, 0.1)

# Test code
if MU_TRUE < (y_bar-dy) or MU_TRUE > (y_bar + dy):
    err_flag = '**'
else:
    err_flag = ''
print (n_e, "=>", y_bar, "+/-", dy, err_flag)

In [None]:
alt_ppf = t.interval (0.95, 9, scale=np.std (Y) / np.sqrt (n_e-1))
assert alt_ppf[0] <= dy <= alt_ppf[1]

**Exercise 5** (2 points). How many runs (`n_e` above) are needed to get a 95% confidence interval of size +/- 10%?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()