<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/lab_2_normal_random_variate_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(tidyverse)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

# Lab 2 - Normal Random Variate Generation

In this lab, we will

1. Explore methods for generating variates for normal distribution, and
2. Practice plotting the estimated distributions.

## Review - Previous Methods

In a previous notebooks--specifically on plotting estimated distributions, but also earlier--we explored the following approaches for generating random variates.

1. Binomial, geometric, and negative binomial data using sampling with replacement,
2. Hypergeometric data by sampling without replacement,
3. Generating exponential random variates by transforming $U\sim uniform(0,1)$, and
4. Generating gamma random variates by summing the transformed uniform variables.

In this lab, we will explore [various approximate and exact methods for generating data from a normal distribution](https://en.wikipedia.org/wiki/Normal_distribution#Computational_methods).

## The Inverse CDF Method.

The first approach for generating normal data is [the probability integral transformation](https://en.wikipedia.org/wiki/Probability_integral_transform), aka the inverse CDF method.

**The inverse CDF method.** If $U\sim uniform(0,1)$ and $F$ is the CDF of some continuous distribution, then the transformation $X=F^{-1}(U)$ will generate data from the distribution associated with $F$.

### Problem 1 - Generating normal data using the inverse CDF method.

Let $X\sim norm(\mu, \sigma)$.  Recall that the `R` `qnorm` function represents the inverse CDF of the normal distribution, that is `qnorm(p, mu, sd)` returns the (unique) value of $x$ such that $F(x) = P(X \le x) = p$.  

First, we will practice this approach for a specific distribution.


#### Part 1 - Simulating heights of women
Let's use this method to approximate normal data. It has be shown that women in their 20s have heights (in inches) that are approximately $norm(67.5, 2.5)$.  Use the inverse CDF method to simulate data from this distribution by,

1. Generating a column of $unif(0,1)$ random variates,
2. Mapping the `qnorm` function with the appropriate mean and standard deviation onto this column.




In [16]:
# Your code here

#### Part 2 - Estimate distributional parameters
Estimate the mean and SD of the distribution and compare to the desired values.

In [7]:
# Your code here

#### Part 3 - Plot the estimated distribution
1. Plot the estimated density,
2. Plot the empirical vs. theoretical CDFs, and
3. Create a p-p plot.

In [8]:
# Your code here

### Problem 2 - A 2-parameter simulation
Repeat the approach in previous problem, but this time generate data over a grid of values for combinations of the mean and standard deviation.

**Hint.** Your should be able to adapt your code from the last problem (one of the reasons for first solve a simple, specific problem first.

#### Part 1 - Generate the data over the parameter space

In [None]:
# Your code here

#### Part 2 - Group by and estimate the distributional parameters

In [14]:
# Your code here

#### Part 3 - Plot a facet grid of estimated distributions.

In [None]:
# Your code here

## The Irwin-Hall approach

An easy and fast approximate method for generating standard normal data is using [the Irwin-Hall distribution](https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution#Approximating_a_Normal_distribution), which is generated using uniform random variates.

### Problem 3 - Generating approximately standard normal data

#### Part 1 - Generate approximate normal data

To generate data from the Irwin-Hall distribution,

1. Generate vectors of 12 $uniform(0,1)$ random variables,
2. Add the 12 values and subtract 6.

In [11]:
# Your code here

#### Part 2 - Verify distributional parameters

Estimate the mean and SD and compare to the expected values (0 and 1, respectively).

In [12]:
# Your code here

#### Part 3 - Plot the estimated distribution

1. Plot the estimated density,
2. Plot the empirical vs. theoretical CDFs, and
3. Create a p-p plot.

In [13]:
# Your code here

## The Box-Muller Method

Suppose $U_1$ and $U_2$ are independent samples chosen from the uniform distribution on the unit interval $(0, 1)$. Let
$$Z_0 = R \cos(\theta) = − 2\ln{U_1}\cos\left( 2\pi U_2\right)$$
and
$$Z_1 = R \sin(\theta) = − 2\ln{U_1}\sin\left( 2\pi U_2\right)$$

Then $Z_0$ and $Z_1$ are independent random variables with a standard normal distribution.

### Problem 4 - A simulation using the Box-Muller method

Perform a simulation to generate data using the Box-Muller method, then verify the properties of the resulting distribution.

#### Part 1 - Generate the data

1. Generate two columns of $unif(0,1)$ random variates,
2. Perform the transformations described above to generate two columns of standard normal data.



In [9]:
# Your code here

#### Part 2 - Estimate distributional parameters. Summary statistics
Estimate the following for the distributions for $Z_0$ and $Z_1$.

1. Estimate the means and verify that $E(Z_0) = E(Z_1) \approx 0$
2. Estimate the SD and verify that $SD(Z_0) = SD(Z_1) \approx 1$
3. Estimate the correlation between $Z_0$ and $Z_1$ using the `cor` function to verify independence*.

<font size=1> *usually correlation isn't enough to verify independance, but it suffices for the normal distribution.</font>


In [10]:
# Your code here


#### Part 3 - Plotting the distributions.

1. First, create a scatter plot of the two variables and verify the independence*,
2. To create faceted plots of both distributions, you need to stack the $Z_0$ and $Z_1$ columns using `gather(cols=c(z0, z1))` (where `z0` and `z1` are the names of the columns containing the two normal variates),
3. Plot the estimated densities for each variable,
4. Plot the empirical vs. theoretical CDFs of each variable, and
5. Create a p-p plots for each variable.

Note that your will want to facet/group-by the column containing the variable names.

<font size=3> *use `geom_point` with an `aes` mapping each variable to `x` and `y`, respectively.   Independent standard normal random variates should result in a circular cloud of points centered at the origin.</font>

In [6]:
# Your coder here

## Tranforming from Standard to Regular Normal Distributions.

If $Z\sim norm(0,1)$ and let $Y = \mu + \sigma Z$, then $Y\sim norm(\mu, \sigma)$.

### Problem 5 - Generating regular normal data.

Finally, we will perform a parametric simulation to verify the transformation.  

#### Part 1 - Generate data over a parameter space.

1. Create a parameter space over a grid of $\mu$ and $\sigma$ values,
2. Generate a column of standard normal random variates,
3. Perform the transformation by mapping the respective values of $\mu$ and $\sigma$ to create a $Y$ column.



#### Part 2 - Estimate distributional parameters.

Group and summarize to estimate the mean and SD of each distribution, then compare to the desired values.

In [None]:
# Your code here

#### Part 3 - Plot the estimated distribution

Create a faceted grid of plots of the estimated distributions.

In [15]:
# Your code here