<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/2_5_introduction_to_parametric_simulations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(tidyverse)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

# Introduction to Parametric Simulations

This chapter we have been studying various discrete parametric distributions.  Sometimes we wish to explore the properties of such a distribution over a set of parameters, and in this case, we can use the `tidyverse` and `purrr` toolset to capture the results.

## Outline

1. Setting up the parameter space using `tribble`.
2. Unnesting the parameters and adding trials.
3. Mapping over the parameters using `pmap`
4. Composing operations to save memory on large simulations.
5. Parametric sampling.

## First Motivating Problem - One Variable Parameter

<font color="aqua">**Our Task.**.   </font> Suppose we wish to explore the effects of the sample size on the mean and variance of a binomial random variable when $p=0.5$.  

<font color="aqua">**The Problem.** </font> Our current approach would require a separate simulation/pipe for each sample size.

<font color="aqua">**The Solution.**</font> Store the sample sizes in our experimental notebook and use them as mapping inputs.

### Three Approachs

1. Stack the trials into a long table, transform, and then group and aggregate. <font size="1">(optional, covered at the end of the notebook)</font>.
2. Store the trials in a list column, then `map` transformations/aggregations.
3. Compose all actions from **2.** into a single `map`

## Performing a Parametric Simulation using a List Column

In this variation of the simulation, we will
1. Store all the trials for each parameter (or combination of parameters) in a single row.
2. Use `map` or `pmap` to process those trials.

### Defining the parameter space using a `parameters`

1. Define all the names on the first row preceeded by `~`
2. Define the respective collections on the second row, respectively.

In [2]:
parameters(~n,
         c(5,10,15))

n
<dbl>
5
10
15


### The "shape" of a parametric simulation using a list column.

1. Set up the parameter space using a `parameters`.
2. Use `map` or `pmap` to generate a list column of trials.
3. Use `map` or `pmap` to transform/summarize.
4. Drop the outcome column.

In [3]:
# Set up parameter space
(parameters(~n,
          c(5,10,15))
 )

n
<dbl>
5
10
15


In [4]:
# Use map to generate trials
p <- 0.5
num.trials <- 10
(parameters(~n,
         c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, p)),
           )
 )

n,.outcome
<dbl>,<list>
5,"3, 2, 1, 2, 4, 3, 2, 3, 3, 3"
10,"5, 5, 6, 5, 6, 5, 5, 5, 2, 6"
15,"7, 8, 8, 10, 7, 8, 9, 7, 10, 8"


In [5]:
# Transform/summarize
p <- 0.5
num.trials <- 10
(parameters(~n,
         c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, p)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"2, 3, 4, 3, 1, 3, 1, 0, 2, 2",2.1,2.5
10,"3, 4, 8, 5, 7, 7, 4, 5, 3, 5",5.1,5.0
15,"9, 6, 6, 9, 9, 6, 7, 9, 12, 6",7.9,7.5


In [6]:
# Drop outcomes
p <- 0.5
num.trials <- 10
(parameters(~n,
            c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.7,2.5
10,4.9,5.0
15,8.1,7.5


In [7]:
# Good estimate by bumping the num.trials
p <- 0.5
num.trials <- 100000
(parameters(~n,
          c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.49454,2.5
10,4.99903,5.0
15,7.51302,7.5


### The Advantage to using a list column of outcomes

The advantage of storing the trials in a list column is
1. All the information is self-contained and apparent, and
2. Makes it easier to verify your code and debug.

In [8]:
# Transform/summarize
p <- 0.5
num.trials <- 10
(parameters(~n,
            c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # Easy to
            approx.mu = map_dbl(.outcome, mean),                # verify
            exact.mu = n*p                                      # correctness
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"3, 0, 4, 3, 3, 3, 1, 3, 5, 2",2.7,2.5
10,"4, 5, 3, 5, 3, 3, 6, 6, 4, 5",4.4,5.0
15,"7, 4, 8, 3, 6, 9, 7, 8, 7, 9",6.8,7.5


### Two Problems with using a list column of outcomes

Two problems with storing outcomes in a list column are

#### 1. Displaying more than a few trials is a mess

In [9]:
# Slightly Better estimates
num.trials <- 100
(parameters(~n,
            c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # Yuck! So long!
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"3, 3, 5, 3, 2, 3, 0, 2, 0, 4, 2, 1, 2, 4, 4, 3, 3, 3, 1, 3, 4, 2, 4, 3, 1, 2, 2, 3, 1, 3, 1, 1, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 3, 1, 3, 3, 1, 4, 3, 3, 3, 1, 3, 5, 4, 2, 1, 2, 2, 2, 5, 5, 3, 5, 0, 3, 1, 3, 2, 3, 2, 0, 3, 4, 1, 3, 2, 4, 2, 3, 4, 3, 2, 1, 2, 4, 4, 0, 3, 3, 3, 2, 1, 3, 3, 1, 1, 3",2.47,2.5
10,"5, 5, 6, 4, 5, 7, 7, 5, 5, 5, 5, 7, 5, 5, 1, 4, 7, 2, 6, 3, 5, 5, 4, 7, 5, 4, 1, 4, 3, 4, 4, 5, 5, 4, 5, 4, 5, 3, 6, 4, 1, 4, 7, 3, 5, 4, 4, 4, 5, 4, 5, 6, 5, 5, 4, 5, 4, 5, 4, 5, 6, 5, 3, 6, 8, 2, 8, 6, 6, 5, 8, 6, 3, 5, 5, 5, 7, 7, 5, 5, 4, 6, 4, 7, 5, 7, 5, 5, 3, 5, 4, 6, 4, 9, 4, 4, 6, 0, 6, 5",4.85,5.0
15,"5, 7, 7, 9, 9, 11, 6, 8, 9, 9, 7, 8, 7, 6, 8, 6, 5, 8, 11, 8, 7, 8, 6, 6, 5, 6, 6, 13, 7, 4, 7, 7, 6, 11, 10, 6, 6, 7, 10, 10, 9, 8, 8, 4, 8, 6, 7, 7, 8, 7, 5, 7, 6, 9, 9, 8, 9, 9, 4, 7, 5, 10, 7, 9, 5, 8, 8, 8, 5, 6, 10, 3, 9, 10, 9, 9, 6, 11, 7, 8, 7, 8, 8, 6, 5, 10, 11, 8, 6, 5, 5, 3, 5, 6, 11, 9, 8, 5, 11, 9",7.46,7.5


#### 2. We are storing a *lot* of data during most intermediate steps.

In [10]:
# Drop outcomes
num.trials <- 100000 # Start adding zeros and see what happens
(parameters(~n,
            c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # <== Lots of data/memory used
            approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.50094,2.5
10,5.00655,5.0
15,7.50234,7.5


### The Solution

The solution is to

1. Prototype the code using a separate list column of outcomes,
2. Verify the correctness of your code, then
2. Compose the steps into a single function

## The "shape" of a composed parametric simulation

1. Set up the parameter space using a `parameters`.
2. Use one `map` or `pmap` to perform the entire process, usually by piping each intermediate result into the next function.

In [11]:
# First prototype WITH the outcome column
p <- 0.5
num.trials <- 10
(parameters(~n,
            c(5,10,15))
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)), # Easy to
            approx.mu = map_dbl(.outcome, mean),                # verify
            exact.mu = n*p                                      # correctness
           )
 )

n,.outcome,approx.mu,exact.mu
<dbl>,<list>,<dbl>,<dbl>
5,"1, 2, 3, 4, 2, 3, 5, 3, 4, 2",2.9,2.5
10,"7, 2, 6, 6, 5, 8, 4, 4, 6, 6",5.4,5.0
15,"7, 6, 8, 5, 8, 10, 7, 4, 4, 8",6.7,7.5


In [12]:
# Second ... Compose!
num.trials <- 100
(parameters(~n,
            c(5,10,15))
 %>% mutate(approx.mu = map_dbl(n, \(n) (rbinom(num.trials, n, 0.5)
                                        %>% mean)),
            exact.mu = n*p
           )
)

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.25,2.5
10,5.32,5.0
15,7.53,7.5


### When can you compose?

When each step is saved, then immediately used on the next step.

#### Tracking the data

Here the raw trials are saved as `.outcome`, which is then used as input on the next mutate.

In [13]:
num.trials <- 10
(parameters(~n,
            c(5,10,15))# ───────┐
 %>% mutate(.outcome = map(n, \(n) rbinom(num.trials, n, 0.5)),
            # │  └─────────────────────┘
            # └────────────────────┐
            approx.mu = map_dbl(.outcome, mean),
            #   │                  └──────┘ │
            #   └───────────────────────────┘
            exact.mu = n*p
           )
 %>% select(-.outcome)
 )

n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.7,2.5
10,3.7,5.0
15,7.0,7.5


#### Tracking the data

Instead of saving the intermediate result (i.e., the raw output) we can simply pipe it into the next action.

In [14]:
# Compose!
num.trials <- 100
(parameters(~n,
            c(5,10,15))# ────────────┐
 %>% mutate(approx.mu = map_dbl(n, \(n) (rbinom(num.trials, n, 0.5)
          #   │                              │
                                        %>% mean)),
          #   └──────────────────────────────┘
            exact.mu = n*p)
           )


n,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>
5,2.63,2.5
10,4.89,5.0
15,7.63,7.5


### <font color="red"> Exercise 2.5.1 </font>

Suppose that we want to compare the cut off for the top 5% of a binomial data with $n = 100$ and $p\in\{0.25, 0.5, 0.75\}

**Tasks.**
1. Prototype the process by storing the raw outcomes in a list column, then processing with additional `map`s,
2. Verify the correctness of your code, and
2. Compose all the intermediate steps to eliminate the need to store the raw outcomes.

In [None]:
# Your code here.

## Parametric Simulations with more than one parameter

A similar process can be used to simulate a scenario where two or more parameters vary across the experiment.  Depending on the scenario, we will
1. **Two parameters.** Use either `map2` or `pmap` to generate the raw outcomes.
2. **Three+ parameters.** Use `pmap` to generate the raw outcomes.

### Example 2 - Investigate the mean part 2

Now suppose we want to investigate the mean of the binomial distribution for all combinations of $n\in\{5,10,15\}$ and $p\in\{0.25, 0.5, 0.74\}$.  

#### Step 1.  Define the parameter space

In [15]:
# Define the parameter space
parameters(~n,         ~p,
           c(5,10,15), c(0.25, 0.5, 0.75))

n,p
<dbl>,<dbl>
5,0.25
5,0.5
5,0.75
10,0.25
10,0.5
10,0.75
15,0.25
15,0.5
15,0.75


#### Step 2.  Generate outcomes with `map2`

In [16]:
# Version 1 - Use map2 to generate the data
num.trials <- 10
(parameters(~n,         ~p,
            c(5,10,15), c(0.25, 0.5, 0.75))
 %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p)))
 )

n,p,.outcome
<dbl>,<dbl>,<list>
5,0.25,"1, 1, 0, 1, 0, 0, 2, 0, 1, 0"
5,0.5,"3, 2, 2, 0, 2, 2, 1, 4, 3, 2"
5,0.75,"4, 3, 4, 3, 4, 4, 3, 5, 5, 5"
10,0.25,"5, 2, 1, 2, 2, 1, 3, 3, 5, 1"
10,0.5,"2, 7, 2, 5, 4, 6, 8, 5, 3, 6"
10,0.75,"8, 9, 9, 7, 9, 6, 8, 7, 4, 7"
15,0.25,"3, 3, 2, 4, 2, 3, 4, 4, 4, 9"
15,0.5,"9, 9, 9, 9, 3, 4, 7, 9, 9, 9"
15,0.75,"9, 11, 11, 13, 10, 13, 10, 9, 11, 11"


#### Step 2. Generate outcomes with `pmap`.

In [17]:
# Version 2 - Use pmap to generate the data
num.trials <- 10
(parameters(~n,         ~p,
            c(5,10,15), c(0.25, 0.5, 0.75))
 %>% mutate(.outcome = pmap(list(size = n, prob = p), rbinom, n = num.trials))
 )

n,p,.outcome
<dbl>,<dbl>,<list>
5,0.25,"2, 3, 1, 2, 1, 2, 3, 1, 1, 2"
5,0.5,"1, 2, 5, 3, 3, 2, 1, 2, 2, 5"
5,0.75,"3, 4, 5, 4, 3, 4, 3, 4, 4, 4"
10,0.25,"2, 2, 3, 3, 1, 3, 5, 2, 4, 3"
10,0.5,"5, 7, 5, 4, 4, 4, 3, 6, 4, 4"
10,0.75,"7, 6, 10, 9, 9, 8, 8, 9, 10, 7"
15,0.25,"4, 4, 4, 7, 3, 3, 8, 2, 6, 3"
15,0.5,"12, 9, 5, 7, 7, 7, 10, 7, 5, 5"
15,0.75,"12, 10, 10, 9, 13, 12, 11, 10, 11, 7"


#### Step 3. Process and summarize.

In [18]:
num.trials <- 10
(parameters(~n,         ~p,
            c(5,10,15), c(0.25, 0.5, 0.75))
 %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p)))
 %>% mutate(approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p)
 )

n,p,.outcome,approx.mu,exact.mu
<dbl>,<dbl>,<list>,<dbl>,<dbl>
5,0.25,"0, 1, 1, 0, 1, 0, 1, 2, 1, 3",1.0,1.25
5,0.5,"3, 0, 1, 2, 2, 4, 1, 0, 4, 2",1.9,2.5
5,0.75,"4, 4, 4, 5, 3, 5, 4, 4, 3, 3",3.9,3.75
10,0.25,"3, 3, 2, 2, 0, 0, 2, 5, 2, 2",2.1,2.5
10,0.5,"2, 5, 4, 4, 4, 4, 7, 4, 4, 5",4.3,5.0
10,0.75,"6, 9, 8, 7, 9, 9, 6, 6, 7, 5",7.2,7.5
15,0.25,"5, 6, 3, 4, 3, 1, 3, 5, 4, 3",3.7,3.75
15,0.5,"6, 12, 7, 10, 7, 6, 7, 9, 8, 7",7.9,7.5
15,0.75,"10, 12, 9, 12, 10, 9, 15, 14, 11, 10",11.2,11.25


#### Step 5. Drop outcomes and bump the number of trials.

In [19]:
num.trials <- 100000
(parameters(~n,         ~p,
            c(5,10,15), c(0.25, 0.5, 0.75))
 %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p)))
 %>% mutate(approx.mu = map_dbl(.outcome, mean),
            exact.mu = n*p)
 %>% select(-.outcome)
 )

n,p,approx.mu,exact.mu
<dbl>,<dbl>,<dbl>,<dbl>
5,0.25,1.25207,1.25
5,0.5,2.5054,2.5
5,0.75,3.75656,3.75
10,0.25,2.49736,2.5
10,0.5,4.99906,5.0
10,0.75,7.49287,7.5
15,0.25,3.75132,3.75
15,0.5,7.50058,7.5
15,0.75,11.25092,11.25


Step 5. Compose into one map.

In [20]:
num.trials <- 100000
(parameters(~n,         ~p,
            c(5,10,15), c(0.25, 0.5, 0.75))
#  %>% mutate(.outcome = map2(n, p, \(n, p) rbinom(num.trials, n, p))
#  %>% mutate(approx.mu = map_dbl(.outcome, mean),
            # exact.mu = n*p)
 %>% mutate(approx.mu = map2(n, p, \(n, p) rbinom(num.trials, n, p) %>% mean),
            exact.mu = n*p)
#  %>% select(-.outcome)
 )

n,p,approx.mu,exact.mu
<dbl>,<dbl>,<list>,<dbl>
5,0.25,1.2478,1.25
5,0.5,2.49818,2.5
5,0.75,3.74908,3.75
10,0.25,2.50165,2.5
10,0.5,4.99959,5.0
10,0.75,7.49168,7.5
15,0.25,3.75373,3.75
15,0.5,7.49284,7.5
15,0.75,11.24194,11.25


### <font color="red"> Exercise 2.5.2 </font>

Suppose that we want to compare the cut off for the top 5% of a binomial data with $n\in c\{25, 50, 100\}$ and $p\in\{0.25, 0.5, 0.75\}$

**Tasks.**
1. Prototype the process by storing the raw outcomes in a list column, then processing with additional `pmap`/`map2`s,
2. Verify the correctness of your code, and
2. Compose all the intermediate steps to eliminate the need to store the raw outcomes.

In [None]:
# Your code here