<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/2_5_introduction_to_parametric_simulations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
library(tidyverse)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

Downloading GitHub repo yardsale8/purrrfect@HEAD




[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/Rtmp6Z1o0I/remotesf428847001/yardsale8-purrrfect-d91fae7/DESCRIPTION’ ... OK
* preparing ‘purrrfect’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘purrrfect_1.0.1.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Introduction to Parametric Simulations

This chapter we have been studying various discrete parametric distributions.  Sometimes we wish to explore the properties of such a distribution over a set of parameters, and in this case, we can use the `tidyverse` and `purrr` toolset to capture the results.

## Outline

1. Setting up the parameter space using `tribble`.
2. Unnesting the parameters and adding trials.
3. Mapping over the parameters using `pmap`
4. Composing operations to save memory on large simulations.
5. Parametric sampling.

## First Motivating Problem - One Variable Parameter

<font color="aqua">**Our Task.**.   </font> Suppose we wish to explore the effects of the sample size on the mean and variance of a binomial random variable when $p=0.5$.  

<font color="aqua">**The Problem.** </font> Our current approach would require a separate simulation/pipe for each sample size.

<font color="aqua">**The Solution.**</font> Store the sample sizes in our experimental notebook and use them as mapping inputs.

### Three Approachs

1. Stack the trials into a long table, transform, and then group and aggregate. <font size="1">(optional, covered at the end of the notebook)</font>.
2. Store the trials in a list column, then `map` transformations/aggregations.
3. Compose all actions from **2.** into a single `map`

## Performing a Parametric Simulation using a List Column

In this variation of the simulation, we will
1. Store all the trials for each parameter (or combination of parameters) in a single row.
2. Use `map` or `pmap` to process those trials.

### Defining the parameter space using a `tribble`

1. Define all the names on the first row preceeded by `~`
2. Define the respective collections on the second row, respectively.

In [5]:
parameters <- tribble(~n,
                      c(5,10,15))
parameters

n
<list>
"5, 10, 15"


### The "shape" of a parametric simulation using a list column.

1. Set up the parameter space using a `tribble`.
2. Unnest the parameter(s) of interest.
3. Use `map` or `pmap` to generate a list column of trials.
4. Use `map` or `pmap` to transform/summarize.
5. Drop the outcome column.

In [6]:
# Unnested parameters
num.trials <- 10
p <- 0.5
(parameters
 %>% unnest_longer(n)
 )

n
<dbl>
5
10
15


In [7]:
# Use map to generate trials
p <- 0.5
num.trials <- 10
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
           )
 )

n,.outcome
<dbl>,<list>
5,"4, 3, 5, 2, 3, 3, 3, 3, 2, 5"
10,"8, 8, 5, 5, 7, 5, 6, 5, 7, 3"
15,"7, 8, 6, 4, 5, 3, 6, 8, 7, 4"


In [8]:
# Transform/summarize
p <- 0.5
num.trials <- 10
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"1, 3, 4, 2, 4, 1, 2, 4, 2, 3",2.6,2.5
10,"4, 5, 6, 9, 7, 4, 7, 6, 7, 3",5.8,5.0
15,"6, 8, 8, 5, 7, 5, 5, 6, 6, 7",6.3,7.5


In [9]:
# Drop outcomes
p <- 0.5
num.trials <- 10
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.2,2.5
10,4.8,5.0
15,7.8,7.5


In [10]:
# Good estimate by bumping the num.trials
p <- 0.5
num.trials <- 100000
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.49823,2.5
10,4.99673,5.0
15,7.50208,7.5


### The Advantage to using a list column of outcomes

The advantage of storing the trials in a list column is
1. All the information is self-contained and apparent, and
2. Makes it easier to verify your code and debug.

In [11]:
# Transform/summarize
p <- 0.5
num.trials <- 10
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # Easy to
            approx.mu = map_dbl(.outcome, mean),                # verify
            exact.mu = n*p                                      # correctness
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"4, 4, 2, 2, 5, 3, 3, 3, 3, 1",3.0,2.5
10,"3, 5, 3, 6, 4, 9, 5, 6, 3, 7",5.1,5.0
15,"7, 9, 9, 10, 9, 8, 10, 8, 5, 9",8.4,7.5


### Two Problems with using a list column of outcomes

Two problems with storing outcomes in a list column are

#### 1. Displaying more than a few trials is a mess

In [None]:
# Slightly Better estimates
num.trials <- 100
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"2, 3, 3, 1, 5, 2, 4, 3, 2, 4, 2, 3, 3, 3, 3, 3, 4, 4, 4, 2, 3, 2, 3, 0, 1, 4, 3, 1, 2, 4, 4, 3, 2, 4, 1, 0, 3, 3, 2, 1, 3, 1, 2, 3, 4, 2, 2, 3, 4, 1, 3, 2, 3, 2, 3, 3, 1, 2, 2, 2, 2, 1, 2, 3, 4, 4, 3, 3, 1, 2, 4, 4, 0, 1, 3, 2, 4, 5, 3, 2, 1, 2, 3, 3, 1, 4, 2, 2, 2, 2, 3, 3, 0, 0, 1, 4, 2, 4, 2, 2",2.5,2.5
10,"6, 4, 5, 5, 4, 8, 6, 5, 6, 4, 5, 5, 3, 8, 6, 6, 4, 7, 4, 6, 7, 6, 4, 7, 4, 4, 6, 7, 3, 4, 5, 7, 4, 2, 3, 4, 6, 5, 5, 6, 7, 3, 4, 5, 4, 6, 6, 6, 6, 4, 4, 4, 8, 6, 7, 7, 5, 9, 6, 4, 7, 7, 3, 3, 5, 4, 3, 4, 7, 6, 3, 5, 3, 5, 4, 8, 5, 6, 5, 4, 3, 6, 4, 6, 2, 6, 3, 8, 4, 4, 6, 3, 3, 6, 6, 6, 6, 5, 5, 4",5.1,5.0
15,"8, 6, 6, 10, 6, 8, 7, 8, 8, 8, 9, 10, 8, 8, 6, 9, 7, 5, 8, 8, 5, 7, 7, 10, 9, 7, 6, 10, 7, 7, 6, 12, 7, 9, 6, 4, 4, 7, 4, 4, 6, 8, 4, 7, 6, 3, 6, 8, 8, 8, 5, 9, 10, 9, 8, 7, 9, 6, 9, 9, 7, 5, 5, 6, 6, 4, 4, 9, 8, 6, 6, 8, 7, 4, 6, 8, 7, 9, 6, 8, 5, 9, 6, 5, 8, 7, 7, 4, 4, 7, 7, 6, 8, 10, 3, 8, 6, 7, 6, 8",6.96,7.5


#### 2. We are storing a *lot* of data during most intermediate steps.

In [None]:
# Drop outcomes
num.trials <- 100000
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # <== Lots of data/memory used
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.49685,2.5
10,5.0072,5.0
15,7.49682,7.5


### The Solution

The solution is to

1. Prototype the code using a separate list column of outcomes, then
2. Compose the steps into a single function

## The "shape" of a composed parametric simulation

1. Set up the parameter space using a `tribble`.
2. Unnest the parameter(s) of interest.
3. Use one `map` or `pmap` to perform the entire process, usually by piping each intermediate result into the next function.

In [12]:
# First prototype WITH the outcome column
p <- 0.5
num.trials <- 10
(parameters
 %>% unnest_longer(n)
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # Easy to
            approx.mu = map_dbl(.outcome, mean),                # verify
            exact.mu = n*p                                      # correctness
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"4, 3, 3, 3, 2, 1, 2, 2, 2, 0",2.2,2.5
10,"8, 8, 5, 4, 5, 3, 5, 4, 5, 7",5.4,5.0
15,"8, 8, 10, 10, 7, 7, 8, 5, 11, 9",8.3,7.5


In [13]:
# Second ... Compose!
num.trials <- 100
(parameters
 %>% unnest_longer(n)
 %>% mutate(approx.mu = map_dbl(n, \(n) (rbinom(num.trials, n, 0.5)
                                        %>% mean)),
            exact.mu = n*p
           )
)

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.41,2.5
10,5.06,5.0
15,7.46,7.5


### When can you compose?

When each step is saved, then immediately used on the next step.

#### Tracking the data

Here the raw trials are saved as `.outcome`, which is then used as input on the next mutate.

In [None]:
num.trials <- 10
(parameters
 %>% unnest_longer(n) # ────────┐
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            # │  └─────────────────────┘
            # └────────────────────┐
            approx.mu = map_dbl(.outcome, mean),
            #   │                  └──────┘ │
            #   └───────────────────────────┘
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

#### Tracking the data

Instead of saving the intermediate result (i.e., the raw output) we can simply pipe it into the next action.

In [None]:
# Compose!
num.trials <- 100
(parameters
 %>% unnest_longer(n) # ─────────────┐
 %>% mutate(approx.mu = map_dbl(n, \(n) (rbinom(num.trials, n, 0.5)
          #   │                              │
                                        %>% mean)),
          #   └──────────────────────────────┘
            exact.mu = n*p
           )
))

### <font color="red"> Exercise 2.5.1 </font>

Suppose that we want to compare the cut off for the top 5% of a binomial data with $n = 100$ and $p\in\{0.25, 0.5, 0.75\}

**Tasks.**
1. Prototype the process by storing the raw outcomes in a list column, then processing with additional `map`s,
2. Verify the correctness of your code, and
2. Compose all the intermediate steps to eliminate the need to store the raw outcomes.

In [None]:
# Your code here.

## Parametric Simulations with more than one parameter

A similar process can be used to simulate a scenario where two or more parameters vary across the experiment.  Depending on the scenario, we will
1. **Two parameters.** Use either `map2` or `pmap` to generate the raw outcomes.
2. **Three+ parameters.** Use `pmap` to generate the raw outcomes.

### Example 2 - Investigate the mean part 2

Now suppose we want to investigate the mean of the binomial distribution for all combinations of $n\in\{5,10,15\}$ and $p\in\{0.25, 0.5, 0.74\}$.  

#### Step 1.  Define the parameter space

In [14]:
# Define the parameter space
two.parameters <-
  tribble(~n,         ~p,
          c(5,10,15), c(0.25, 0.5, 0.75))
two.parameters

n,p
<list>,<list>
"5, 10, 15","0.25, 0.50, 0.75"


#### Step 2. Unnest both parameters

In [15]:
# Unnest both parameters
(two.parameters
 %>% unnest_longer(n)
 %>% unnest_longer(p)
 )

n,p
<dbl>,<dbl>
5,0.25
5,0.5
5,0.75
10,0.25
10,0.5
10,0.75
15,0.25
15,0.5
15,0.75


#### Step 3.  Generate outcomes with `map2`

In [None]:
# Version 1 - Use map2 to generate the data
num.trials <- 10
(two.parameters
 %>% unnest_longer(n)
 %>% unnest_longer(p)
 %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p)))
 )

n,p,.outcome
<dbl>,<dbl>,<list>
5,0.25,"2, 0, 1, 1, 0, 1, 2, 1, 1, 0"
5,0.5,"2, 4, 3, 4, 2, 3, 3, 1, 4, 2"
5,0.75,"4, 4, 3, 3, 3, 3, 3, 5, 4, 4"
10,0.25,"1, 1, 4, 3, 3, 4, 2, 3, 3, 4"
10,0.5,"5, 6, 6, 4, 4, 7, 4, 5, 5, 7"
10,0.75,"7, 8, 8, 6, 8, 9, 8, 8, 5, 8"
15,0.25,"3, 4, 5, 2, 4, 7, 5, 4, 6, 2"
15,0.5,"8, 10, 9, 9, 10, 10, 7, 7, 8, 4"
15,0.75,"13, 10, 10, 11, 13, 10, 10, 9, 11, 11"


#### Step 3. Generate outcomes with `pmap`.

In [16]:
# Version 2 - Use pmap to generate the data
num.trials <- 10
(two.parameters
 %>% unnest_longer(n)
 %>% unnest_longer(p)
 %>% mutate(.outcome = pmap(list(size = n, prob = p), rbinom, n = num.trials))
 )

n,p,.outcome
<dbl>,<dbl>,<list>
5,0.25,"1, 1, 0, 0, 1, 3, 1, 1, 1, 1"
5,0.5,"2, 3, 4, 4, 5, 2, 2, 2, 3, 2"
5,0.75,"4, 2, 4, 3, 2, 4, 3, 3, 3, 4"
10,0.25,"3, 2, 2, 6, 1, 5, 1, 3, 1, 3"
10,0.5,"7, 5, 7, 8, 3, 2, 2, 3, 5, 5"
10,0.75,"6, 5, 7, 8, 10, 6, 8, 9, 8, 4"
15,0.25,"2, 3, 2, 4, 4, 6, 3, 6, 5, 3"
15,0.5,"9, 6, 3, 5, 7, 6, 9, 4, 9, 10"
15,0.75,"14, 10, 13, 11, 7, 12, 11, 13, 14, 12"


#### Step 4. Process and summarize.

In [17]:
um.trials <- 10
(two.parameters
 %>% unnest_longer(n)
 %>% unnest_longer(p)
 %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p)))
 %>% mutate(approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p)
 )

n,p,.outcome,approx.mu,exact.mu
<dbl>,<dbl>,<list>,<dbl>,<dbl>
5,0.25,"1, 2, 1, 1, 0, 3, 4, 1, 1, 2",1.6,1.25
5,0.5,"3, 1, 3, 3, 2, 3, 1, 3, 4, 5",2.8,2.5
5,0.75,"5, 3, 4, 4, 4, 5, 5, 4, 3, 4",4.1,3.75
10,0.25,"3, 2, 2, 3, 5, 4, 2, 0, 4, 3",2.8,2.5
10,0.5,"4, 8, 5, 4, 6, 3, 7, 6, 3, 3",4.9,5.0
10,0.75,"9, 9, 8, 8, 5, 7, 10, 9, 7, 7",7.9,7.5
15,0.25,"1, 4, 6, 6, 2, 4, 0, 2, 5, 6",3.6,3.75
15,0.5,"7, 7, 12, 9, 4, 9, 9, 7, 6, 7",7.7,7.5
15,0.75,"10, 9, 12, 14, 10, 12, 10, 8, 11, 11",10.7,11.25


#### Step 5. Drop outcomes and bump the number of trials.

In [18]:
num.trials <- 100000
(two.parameters
 %>% unnest_longer(n)
 %>% unnest_longer(p)
 %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p)))
 %>% mutate(approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p)
 %>% select(-.outcome)
 )

n,p,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>,<dbl>
5,0.25,1.24692,1.25
5,0.5,2.50127,2.5
5,0.75,3.74703,3.75
10,0.25,2.50215,2.5
10,0.5,5.00122,5.0
10,0.75,7.50624,7.5
15,0.25,3.74455,3.75
15,0.5,7.49902,7.5
15,0.75,11.25149,11.25


Step 5. Compose into one map.

In [19]:
num.trials <- 100000
(two.parameters
 %>% unnest_longer(n)
 %>% unnest_longer(p)
#  %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p))
#  %>% mutate(approx.mu = map_dbl(.outcome, mean),
            # exact.mu = n*p)
 %>% mutate(approx.mu = (map2(n, p, \(n, p) rbinom(num.trials, n, p))
                        %>% map_dbl(mean)),
            exact.mu = n*p)
#  %>% select(-.outcome)
 )

n,p,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>,<dbl>
5,0.25,1.25433,1.25
5,0.5,2.50147,2.5
5,0.75,3.74747,3.75
10,0.25,2.49352,2.5
10,0.5,5.00414,5.0
10,0.75,7.5039,7.5
15,0.25,3.74011,3.75
15,0.5,7.50689,7.5
15,0.75,11.24131,11.25


### <font color="red"> Exercise 2.5.2 </font>

Suppose that we want to compare the cut off for the top 5% of a binomial data with $n\in c\{25, 50, 100\}$ and $p\in\{0.25, 0.5, 0.75\}$

**Tasks.**
1. Prototype the process by storing the raw outcomes in a list column, then processing with additional `pmap`/`map2`s,
2. Verify the correctness of your code, and
2. Compose all the intermediate steps to eliminate the need to store the raw outcomes.

In [None]:
# Your code here

## (Optional) Stacked Parametric Simulations

The other option, not covered above, is to stack the outcomes in a simple, atomic column.  The advantage to this approach is simplier mutates (rarely need to map), but at the cost of spreading the related trials across multiple rows--requiring an extra grouped aggregation at the end.

### The "shape" of a stacked parametric simulation.

1. Set up the parameter space using a `tribble`.
2. Unnest the parameter(s) of interest.
3. Add trials to convert the parameter space into an experimental notebook.
4. Use `map` or `pmap` to generate outcomes using the parameter column(s) as input.
5. Proceed as usual (estimate probabilities, expected values, etc.).

### Setting up the parameter space

The `tribble` functions can create the nested parameter space, with

1. The names of the column, starting with `~`, on the first line, and
2. Vectors or lists of parameters on the second line.

Note that you need to line up the names and parameter vectors, respectively (first with first, second with second, etc.). The output is a `tibble` with one row, containing a list column for each parameter.



In [None]:
parameters <- tribble(~n,
                      seq(5,15,5))
parameters

n
<list>
"5, 10, 15"


### Unnesting the parameters.

Next, we need to spread the parameters over multiple rows, with one parameter per row.  This is accomplished using `unnest_longer`

In [None]:
(parameters
 %>% unnest_longer(n)
 )

n
<dbl>
5
10
15


### Adding trials

Next, we need to replicate each parameter over a number of trials, which is accomplished using `add_trials`

In [None]:
num.trials <- 5
(parameters
 %>% unnest_longer(n)
 %>% add_trials(num.trials)
 )

n,.trial
<dbl>,<dbl>
5,1
5,2
5,3
5,4
5,5
10,1
10,2
10,3
10,4
10,5


#### Generating outcomes with `map`

To use map, we need to

1. Map onto the parameter column, and
2. Write a function that takes this parameter as input.

As alway, select the version of `map` that results in an atomic column type.  In this case, we will use `map_int`, as `rbinom` generates intergers.

In [None]:
num.trials <- 5
p <- 0.5
(parameters
 %>% unnest_longer(n)
 %>% add_trials(num.trials)
 %>% mutate(num.successes = map_int(n, \(n) rbinom(1, n, p)))
 )

n,.trial,num.successes
<dbl>,<dbl>,<int>
5,1,3
5,2,3
5,3,3
5,4,4
5,5,3
10,1,7
10,2,6
10,3,6
10,4,5
10,5,2


### Continue as usual

Finally, we can complete the task, which involves grouping and aggregating to get estimates of the long-run mean for each sample size.

In [None]:
# Test on small number of trials
num.trials <- 5
p <- 0.5
(parameters
 %>% unnest_longer(n)
 %>% add_trials(num.trials)
 %>% mutate(num.successes = map_int(n, \(n) rbinom(1, n, p)))
 %>% group_by(n)
 %>% summarise(approx.mu = mean(num.successes))
 %>% mutate(exact.mu = n*p)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,3.0,2.5
10,5.0,5.0
15,9.6,7.5


In [None]:
# Better estimate by bumping the number of trials
num.trials <- 10000
p <- 0.5
(parameters
 %>% unnest_longer(n)
 %>% add_trials(num.trials)
 %>% mutate(num.successes = map_int(n, \(n) rbinom(1, n, p)))
 %>% group_by(n)
 %>% summarise(approx.mu = mean(num.successes))
 %>% mutate(exact.mu = n*p)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.5037,2.5
10,4.9891,5.0
15,7.5205,7.5
