<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/1_5_transforming_compound_outcomes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(dplyr)
library(tidyr)
library(purrr)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

Downloading GitHub repo yardsale8/purrrfect@HEAD




[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/Rtmpzv1zRw/remotes19d158f463e/yardsale8-purrrfect-d91fae7/DESCRIPTION’ ... OK
* preparing ‘purrrfect’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘purrrfect_1.0.1.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



## Review -  Levels of Abstraction of a list column

<img src="https://github.com/yardsale8/probability_simulations_in_R/blob/main/img/1_3_levels_of_abstraction.png?raw=true" width="600"/>

A list column is an example of a nested data stucture, that is a data structure that contains another data structure.  In the case shown above, we have
1. A dataframe containing
2. A list `.outcome` column containing
3. Integer vectors containing
4. Raw integers.

Each of these represents a level of abstraction.

### Review -  the `str` output for a list column

<img src="https://github.com/yardsale8/probability_simulations_in_R/blob/main/img/1_3_str_shows_levels.png?raw=true" width="600">

## Three Types of Problems

Many probability problems can be classified as one of the following
1. Questions about a <font color="orange">**summary statistic**</font> like the sum, mean, or max,
2. Questions about a <font color="orange">**count of successes**</font> like counting the number of heads, or
3. Questions about the <font color="orange">**order/sequence**</font> of events.

We will illustrate methods for solving each type by either reshaping the outcomes or using `mutate` and `map`, but before we start you should practice classifying problems as one of there three types.

### <font color="red"> Exercise 1.5.1 - Classifying problems</font>

You roll a regular six-sided die twice and record the pair of outcomes.  For each of the following questions, classify the problem as related to either (A) a summary statistic, (B) counting successes, or (C) the sequence of events.
* Event A that at least one of the rolls is 5.
* Event B that the maximum of the values rolled is 3.
* Event C that the first roll is larger than the


<font color="orange">
Your classifications here
</font>

## Solutions Based on Reshaping Outcomes

First, we illustrate solving these problems by reshaping the outcomes, that is
* *splitting* the outcomes across many simple columns, or
* *stacking* the outcomes in separate rows.

### Solving each problem type via reshaping

Here's is a sketch of how to solve each problem type via reshaping.
* **Summary statistic.** Stack $\longrightarrow$ group & summarize.
2. **Count of successes.** Stack $\longrightarrow$ recode as 1 or 0 $\longrightarrow$  group & summarize(sum).
3. **Order/sequence.** Split $\longrightarrow$ mutate.

Let's illustrate by solving three problems about rolling a die, one of each type.

### Example - Rolling a fair 20-sided die.

Suppose that we want are rolling a 20 sided die twice and want to know the probability that
1. the average is larger than 12 (summary statistic),
2. at least one roll larger than 15 (counting successes), and
3. the two rolls are the same (order).

Below, we will solve each questions by reshaping the outcomes.

#### Problem 1 - The probability that the average is larger than 12 (summary statistic)

**Strategy:** Stack $\longrightarrow$ group & summarize.

Uncomment each line, rerun the code, and inspect.

In [None]:
N <- 10
( replicate(N, sample(1:20, 2, replace=TRUE), .reshape = 'stack') # Stack
  # %>% group_by(.trial) %>% summarise(avg = mean(.outcome)) # Group and summarized
  # %>% mutate(mean.greater.12 = avg > 12) # Boolean event column
  # %>% estimate_all_prob # Increase N, uncomment, and estimate
)

.trial,.replication,.outcome
<dbl>,<int>,<int>
1,1,1
1,2,17
2,1,20
2,2,5
3,1,17
3,2,8
4,1,20
4,2,4
5,1,11
5,2,2


#### Problem 2 - The probability that at least one roll is greater than 15 (counting successes)

**Strategy.** Stack $\longrightarrow$ recode as 1 or 0 $\longrightarrow$  group & summarize(sum).

Uncomment each line, rerun the code, and inspect.

In [None]:
N <- 10
( replicate(N, sample(1:20, 2, replace=TRUE), .reshape = 'stack') # Stack
  %>% mutate(is.success = ifelse(.outcome >= 15, 1, 0)) # Recode successes
  # %>% group_by(.trial) %>% summarise(num.successes = sum(is.success)) # Group and summarized
  # %>% mutate(at.least.one.greater.15 = num.successes >= 1) # Boolean event column
  # %>% estimate_all_prob # Increase N, uncomment, and estimate
)


.trial,.replication,.outcome,is.success
<dbl>,<int>,<int>,<dbl>
1,1,9,0
1,2,10,0
2,1,16,1
2,2,13,0
3,1,1,0
3,2,20,1
4,1,17,1
4,2,12,0
5,1,16,1
5,2,18,1


#### Problem 3 - The probability that the two rolls are the same (order/sequence)

**Strategy.** Split $\longrightarrow$ mutate.

Uncomment is line, rerun, and inspect the results.

In [None]:
N <- 10
( replicate(N, sample(1:20, 2, replace=TRUE), .reshape = 'split') # Split
  %>% mutate(rolls.equal = .outcome.1 == .outcome.2) # Mutate
  # %>% estimate_all_prob # Increase N, uncomment, and estimate
)

.trial,.outcome.1,.outcome.2,rolls.equal
<dbl>,<int>,<int>,<lgl>
1,16,9,False
2,2,12,False
3,19,11,False
4,7,4,False
5,10,15,False
6,6,20,False
7,12,8,False
8,18,10,False
9,1,17,False
10,16,9,False


### Problem types and random variables

Recall that a random variable is a function that codes the outcomes, denoted as $\omega$, as a number, e.g., denoted as $X(\omega)$.  Each of the previous techniques can be thought of as computing one or more random variables.

#### Problem 1 - $X$ is the summary statistic

In [None]:
N <- 10
( replicate(N, sample(1:20, 2, replace=TRUE), .reshape = 'stack') # Stack
  %>% group_by(.trial) %>% summarise(avg = mean(.outcome)) # Group and summarized
)

.trial,avg
<dbl>,<dbl>
1,7.0
2,4.0
3,18.5
4,16.0
5,10.5
6,6.5
7,7.5
8,16.0
9,10.5
10,17.0


##### Problem 2 - $X$ is the number of successes

In [None]:
N <- 10
( replicate(N, sample(1:20, 2, replace=TRUE), .reshape = 'stack') # Stack
  %>% mutate(is.success = ifelse(.outcome >= 15, 1, 0)) # Recode successes
  %>% group_by(.trial) %>% summarise(num.successes = sum(is.success)) # Group and summarized
)


.trial,num.successes
<dbl>,<dbl>
1,1
2,1
3,2
4,0
5,1
6,1
7,1
8,0
9,0
10,1


#### Problem 3 - $X_1$ and $X_2$ are the first and second rolls.

In [None]:
N <- 10
( replicate(N, sample(1:20, 2, replace=TRUE), .reshape = 'split') # Split
)


.trial,.outcome.1,.outcome.2
<dbl>,<int>,<int>
1,18,9
2,12,8
3,3,7
4,4,9
5,11,13
6,13,14
7,16,6
8,4,5
9,14,7
10,9,8


### <font color="red"> Exercise 1.5.2 - Reshaping outcomes</font>

You roll a regular six-sided die twice and record the pair of outcomes.  For each of the following questions, classify the problem as related to either (A) a summary statistic, (B) counting successes, or (C) the sequence of events.
* Event A that at least one of the rolls is 5.
* Event B that the maximum of the values rolled is 3.
* Event C that the first flip is larger than the second.

**Tasks.**

1. Estimate the probability of each event by reshaping the outcomes, then
2. Define the random variables used in each solution.

In [None]:
# Your code here.

<font color="orange">
Identify the random variables here.
</font>

## Using `mutate` + `map` on Compound Simulations

The other approach to solving the three type of problems involves using `mutate` and `map` to work directly with the outcomes in the list column.



### Piercing levels of abstraction with `mutate` and `map`
<img src="https://github.com/yardsale8/probability_simulations_in_R/blob/main/img/1_3_mutate_map_and_levels.png?raw=true" width="600">

The functions `mutate` and `map` provide the tools needed to reach into a table and transform data at various levels.
- `mutate(col1 = v(col1))` will apply the vectorized functions `v` to the whole `col1`
- `mutate(col1 = map(col1, v)` will apply the vectorized functions `v` to each element of`col1`
- `mutate(col1 = map(col1, \(x) map(x, f))` will apply the functions `f` to each element of each of the lists in `col1`


### A `mutate` + `map` strategy for each problem type.

Here's is a sketch of how to solve each problem type using `mutate` + `map`
1. **Summary statistic.** Map the summary statistic onto the list column of outcomes.
2. **Count of successes.** Use `map` to recode as 1 or 0 $\longrightarrow$ map `sum` onto recoded values.
3. **Order/sequence.** Use `map` to extract important information.

Let's illustrate by returning to the three problems about rolling a die, one of each type.



### Example - Rolling a fair 20-sided die.

Recall that we want are rolling a 20 sided die twice and want to know the probability that
1. the average is larger than 12 (summary statistic),
2. at least one roll larger than 15 (counting successes), and
3. the two rolls are the same (order).

Below, we will solve each questions by reshaping the outcomes.

#### Problem 1 - Map the average onto outcomes

In [None]:
N <- 10
(replicate(N, sample(1:20, 2, replace=TRUE))
%>% mutate(avg = map_dbl(.outcome, mean))
%>% mutate(avg.larger.12 = avg > 12)
# %>% estimate_all_prob
)

.trial,.outcome,avg,avg.larger.12
<dbl>,<list>,<dbl>,<lgl>
1,"20, 16",18.0,True
2,"1, 7",4.0,False
3,"20, 3",11.5,False
4,"18, 4",11.0,False
5,"7, 17",12.0,False
6,"17, 12",14.5,True
7,"16, 4",10.0,False
8,"19, 3",11.0,False
9,"3, 13",8.0,False
10,"20, 10",15.0,True


#### Problem 2 - Recode then sum

In [None]:
N <- 10
(replicate(N, sample(1:20, 2, replace=TRUE))
%>% mutate(is.success = map(.outcome, \(x) ifelse(x >= 15, 1, 0))) # Recode
%>% mutate(num.successes = map_int(is.success, sum))
%>% mutate(at.least.one.15 = num.successes >= 1)
# %>% estimate_all_prob
)

.trial,.outcome,is.success,num.successes,at.least.one.15
<dbl>,<list>,<list>,<int>,<lgl>
1,"3, 8","0, 0",0,False
2,"11, 10","0, 0",0,False
3,"20, 7","1, 0",1,True
4,"4, 11","0, 0",0,False
5,"6, 10","0, 0",0,False
6,"13, 5","0, 0",0,False
7,"5, 6","0, 0",0,False
8,"20, 1","1, 0",1,True
9,"9, 20","0, 1",1,True
10,"9, 15","0, 1",1,True


### Review - Accessing vector and list entries.

Before we solve the third problem, which involved the order/sequence/position of the outcomes, we need to review methods of accessing elements of vectors and lists.


In [None]:
(v <- 1:6)

In [None]:
(l <- list(a = 1, b = 2))

#### Accessing a vector element by position using `v[k]`

In [None]:
v[2]

#### Accessing a list element by position using `l[k]`

In [None]:
l[1]

#### Accessing a list element by name using `l[[name]]`

In [None]:
l[['a']]

#### Problem 3 - `mutate` + `map` while accessing relevant parts.


**Strategy.** Use `mutate` + `map` to extract important information by position, as shown above



In [None]:
N <- 10
(replicate(N, sample(1:20, 2, replace=TRUE))
%>% mutate(is.equal = map_lgl(.outcome, \(v) v[1] == v[2])) # Extract position ==> get Boolean
# %>% estimate_all_prob
)

.trial,.outcome,is.equal
<dbl>,<list>,<lgl>
1,"2, 11",False
2,"16, 16",True
3,"16, 17",False
4,"3, 2",False
5,"3, 16",False
6,"16, 13",False
7,"1, 16",False
8,"8, 7",False
9,"6, 15",False
10,"5, 3",False


### <font color="red"> Exercise 1.5.3 - `mutate` + `map`</font>

You roll a regular six-sided die twice and record the pair of outcomes.  For each of the following questions, classify the problem as related to either (A) a summary statistic, (B) counting successes, or (C) the sequence of events.
* Event A that at least one of the rolls is 5.
* Event B that the maximum of the values rolled is 3.
* Event C that the first roll is larger than the second.

**Task.** Estimate the probability of each event using `mutate` + `map` the outcomes.

In [None]:
# Your code here