<h2> Exploring the Central Limit Theorem </h2>

In this notebook, we're going to explore the central limit theorem. The theorem says that as $N \to \infty$, an appropriately scaled average of $N$ IID random variables will converge to a normal distribution. A natural question that comes up is how large $N$ should be in order to use the central limit theorem accurately and efficiently: if $N$ is too small, then the limit is a bad approximation and if $N$ is too large, we require too much data to be practical.

From Thursday's class, we did an example where we averaged 30 exponential random variables with distribution $\operatorname{Exp}(1/10)$, and approximated the average as $\operatorname{N(10, 100/30)}.$ We can also directly simulate the sum of $30$ exponential random variables (we've implemented `Exp` in previous notebooks, and in last week's notebook you took the average of a bunch of random variables!) and compare the estimates for $P(\overline{X}_{30} > 14)$. The central limit theorem estimate for the probability is about $1.4\%$, while the simulated sum of $30$ exponential variables is around $2.2\%$ (using $10^6$ trials). 

Although these numbers are on the same scale (each representing a fairly unlikely outcome that's not too extreme), they are still off by nearly $60\%$. This is because $30$ is a small number of trials, and the central limit theorem improves as we take a larger number of trials.

<h3> This week's problems </h3>

You'll explore some of these probabilities and see how the estimates improve as $N$ increases; this will give you a feel for when the CLT is actually applicable in practice. 

* Verify the numbers above, that the probability that a sum of $30$ IID $\operatorname{Exp}(1/10)$ random variables will have probability around $0.022$ of being $\ge 14$.

* For a sum of $100$ IID $\operatorname{Exp}(1/10)$ random variables, simulate the probability that the average is $\ge 11$ and $\le 9$. Compare this to the results of the central limit theorem; is there better agreement than with $30$ random variables?

* For a sum of $10$ IID $\operatorname{Unif(-1, 1)}$ random variables, simulate the probability that the **sum** is $\ge 1$, $\ge 2$, and $\ge 5$. Compare this to the results of the central limit theorem.

* Repeat the previous part but for $20, 30, 100,$ and $500$ IID $\operatorname{Unif}(-1, 1)$ random variables, and compare to the CLT. For which values of $N$ do you think it's reasonable to estimate with the CLT?

To help get you started, the following code implements the z-score calculation. You can also look at recent notebooks to find code that implements a running sum of samples and code that implements the uniform distribution. One useful trick to remember is that if $R$ is a $\operatorname{Unif}(0, 1)$ random variable, then $2R - 1$ has a $\operatorname{Unif}(-1, 1)$ distribution.

In [8]:
# Need to take square roots:
from math import sqrt

# You'll also need to generate random numbers:
from random import random

def zScore(mu, var, observed, N):
    # Compute the z-score of an observation given 
    # the number of trials (N) of IID random variables
    # all having mean mu and variance var

    return (observed - mu) / sqrt(var / N)

# Example calculation from Thursday's class with
# the exponential distribution:
print(f'z-score: {zScore(10, 100, 14, 30)}')


z-score: 2.1908902300206643
