<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/1_1_introduction_to_simulations_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
library(dplyr)
library(tidyr)
library(purrr)
library(devtools)

In [3]:
install_github('yardsale8/purrrfect', force = TRUE)

Downloading GitHub repo yardsale8/purrrfect@HEAD




[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/Rtmps4FqfD/remotes2ac6cc6dae1/yardsale8-purrrfect-d91fae7/DESCRIPTION’ ... OK
* preparing ‘purrrfect’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘purrrfect_1.0.1.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [4]:
library(purrrfect)


Attaching package: ‘purrrfect’


The following objects are masked from ‘package:base’:

    replicate, tabulate




# Introduction to Simulations in `R`

**Outline**
1. Simulating Bernoulli trials using sampling with replacement.<br>
  a. Review of Bernoulli trials<br>
  b. Using `replicate` and `sample` to simulate Bernoulli trials<br>
2. Simulating the classic urn problem using sampling without replacement<br>
  a. Review of the urn problem<br>
  b. Using `replicate` and `sample` to draws from an urn<br>
3. Basic techniques for estimating probabilities
  a. Using `col_num_successes` to compute the number of successes.<br>
  b. Solving location-based questions by splitting the outcomes<br>
  c. Estimating/tabulating the estimated probabilities.

## Review of Bernoulli Trials

Bernoulli trials have
* independent trials
* binary outcomes: success and failure
* Fixed $P(success)$





Examples
* Tossing a fair or biased coin
* Rolling a fair or biased die

## Simulating Bernoulli trials using `replicate` and `sample`

The simpliest way to simulate a Bernoulli trial is be sampling.  To do this we

1. Define a vector to use as a sample space,
2. (Optional) Define a vector of probabilies when outcomes are not equally likely, and
3. Write and test the expression for a single trial using `sample`.  Be sure to use `replace = TRUE`
4. Use `replicate` on this expression to get a number of trials.<br>
   a. Prototype your code with a small number of trials. <br>
   b. Once your code is working/verified, switch to a large number of make estimates.

### Example 1 - Toss a fair coin

Suppose we want to toss a fair coin three times.

In [5]:
coin <- c('H', 'T')
coin

In [6]:
sample(coin, 3, replace = TRUE)

In [7]:
trials <- replicate(10, sample(coin, 3, replace = TRUE))
trials

.trial,.outcome
<dbl>,<list>
1,"H, H, H"
2,"T, T, H"
3,"T, T, T"
4,"H, T, H"
5,"T, H, T"
6,"H, T, H"
7,"H, T, H"
8,"T, T, T"
9,"H, T, H"
10,"T, T, H"


#### Task - Debug the following code and insure the resulting trials are displayed.

In [None]:
N <- 10 # small to prototype
trials <- replicate(N, sample(coin, 3))


### Example 2 - Toss a biased coin

Suppose we have a coin that comes up heads 65% of the time and we simulate tossing this coin four times. To simulate this scenario, we would need to either
1. Create a sample space with 65% heads and 35% tails, or
2. Use a vector of probabilities.

Let's look at both approaches.

#### Approach 1 - Create a proportional sample space.

Note that we can use the `rep` function to replicate the heads and tails.

In [10]:
coin_prop <- c(rep('H', 65), rep('T', 35))
coin_prop

We can use `tabulate` to verify the sample space has the correct proportions.

In [13]:
data.frame(x = coin_prop) %>% tabulate(x)

X = H,X = T
<dbl>,<dbl>
0.65,0.35


In [14]:
sample(coin_prop, 4, replace=TRUE)

In [15]:
replicate(10, sample(coin_prop, 4, replace=TRUE))

.trial,.outcome
<dbl>,<list>
1,"H, T, T, H"
2,"H, T, H, H"
3,"T, H, H, T"
4,"H, H, H, H"
5,"T, T, H, H"
6,"H, H, H, H"
7,"T, T, H, H"
8,"H, H, T, T"
9,"H, H, T, H"
10,"H, H, H, T"


#### Approach 2 - Use a vector of probabilites.

We start by defining the collect of unique outcomes.

In [19]:
coin <- c('H', 'T')
coin

Next, we define the respective probabilites.

In [20]:
ps <- c(0.65, 0.35)
ps

The probability vector is passed to the optional `prop` argument of `sample`.

In [21]:
sample(coin, 4, replace = TRUE, prob = ps)

Once we have an expression for one trial, we wrap this expression in `replicate`

In [22]:
trials <- replicate(10, sample(coin, 4, replace = TRUE, prob = ps))
trials

.trial,.outcome
<dbl>,<list>
1,"H, H, H, T"
2,"H, H, H, H"
3,"T, H, H, H"
4,"T, H, T, H"
5,"H, H, T, H"
6,"H, H, H, H"
7,"T, H, H, H"
8,"T, T, H, T"
9,"T, T, H, H"
10,"T, H, H, T"


## <font color='red'> Exercise 1 - Rolling a 6-sided die </font>

Simulate rolling a fair 6-sided die.

In [None]:
# Your code here

## Review - The classic urn problem.

In the classic urn problem, we have
1. An urn containing a finite number of chips of two colors
2. A fixed number of chips drawn from this urn without replacement,

## Simulating the urn problem using `replicate` and `sample`

The simpliest way to simulate chips drawn from an urn is also sampling.  To do this we

1. Define a vector to use as a sample space.  We can use the `rep` function to get replicates for each color.
2. Write and test the expression for a single trial using `sample`.  Make sure to use `replace = FALSE`
3. Use `replicate` on this expression to get a number of trials.<br>
   a. Prototype your code with a small number of trials. <br>
   b. Once your code is working/verified, switch to a large number of make estimates.

### Example 3

Suppose we have
* an urn with five blue chips and three white chips and
* we will be drawing two chips at random without replacement.

In [23]:
urn <- c(rep('B', 5), rep('W', 3))
urn

In [24]:
sample(urn, 2, replace = FALSE)

In [25]:
trials <- replicate(10, sample(urn, 2, replace = FALSE))
trials

.trial,.outcome
<dbl>,<list>
1,"B, B"
2,"W, B"
3,"B, W"
4,"W, W"
5,"B, W"
6,"W, B"
7,"W, B"
8,"B, B"
9,"B, B"
10,"B, W"


## Tricks for estimating probabilities

Next, we need to manipulate the table of results, then estimate the proability for question(s) of interest.  We will look at the following tricks.
1. Compute the number of success for both simple and compound events.
2. Create a Boolean column and use `estimate_prob` to make an estimate.
3. Split the outcomes and use `tabulate` to compute a marginal distribution.
4. Split the outcomes to answer questions about the location of the

### Guiding experiment

Suppose we want to roll a 20-sided die three times.  First, let's write the code to simulate the trials.

In [26]:
die <- 1:20
die

In [27]:
sample(die, 3, replace = TRUE)

In [28]:
replicate(10, sample(die, 3, replace = TRUE))

.trial,.outcome
<dbl>,<list>
1,"15, 7, 15"
2,"2, 17, 13"
3,"20, 15, 8"
4,"3, 1, 11"
5,"19, 2, 1"
6,"12, 12, 3"
7,"8, 16, 1"
8,"3, 10, 3"
9,"15, 4, 7"
10,"20, 5, 7"


### Trick 1.  Compute the number of success for both simple and compound events.

1. **Simple Event.** Pipe into `col_num_successes(.outcome, value)`
2. **Compount Event.**<br>
  a. Write/save a Bolean success function, and<br>
  b. Pipe into `col_num_successes(.outcome, functions)`

#### Simple Event Example

Suppose we consider any roll of 18 a success.  Estimate the distribution of the number of successes.

**Step 1 - Use `col_num_successes` to count the number of successes.**

Since there is exactly on outcome in this event, we can simply pass this outcome to the function.

In [None]:
(replicate(10, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, 18)
)

.trial,.outcome,.successes
<dbl>,<list>,<int>
1,"3, 18, 2",1
2,"9, 6, 16",0
3,"13, 3, 12",0
4,"3, 13, 14",0
5,"18, 9, 12",1
6,"12, 13, 17",0
7,"14, 20, 7",0
8,"6, 5, 7",0
9,"11, 7, 2",0
10,"14, 6, 15",0


**Step 2 - Tabulate the results.**

We use the `tabulate` function to estimate the probabilites for all outcomes in a given columns.  Note that we want to prototype our code on a small number of trials, then increase the number of trials to get better estimates once we have established our code is working.

In [29]:
# prototype with small N first
N <- 100
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, 18)
 %>% tabulate(.successes)
)

X = 0,X = 1
<dbl>,<dbl>
0.85,0.15


In [30]:
# Estimate with large N
N <- 100000
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, 18)
 %>% tabulate(.successes)
)

X = 0,X = 1,X = 2,X = 3
<dbl>,<dbl>,<dbl>,<dbl>
0.85791,0.13518,0.00681,0.0001


#### Compound Event Example

**Task.** Estimate the chance of getting at least one roll that is 19 or more.

Note that
1. A success is any roll of 19 or more.
2. We can compute the total number of success and then check if this number is at least one.

**Step 1 - Define a Boolean helper functions.**

Note that the set of successes has more than a single outcome, making it a compound event.  In this case, we need to define a *Boolean helper function* that return `TRUE` for anyone outcome that is a success and otherwise returns `FALSE`.

In [32]:
# define the function
at.least.19 <- \(x) x >= 19

# Test the function
at.least.19(c(2, 18, 19, 20))

**Step 2 - Compute the number of successes.**

To compute the successes for a compound event, we pass the Boolean helper function to the `col_num_successes` function.

In [33]:
(replicate(10, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, at.least.19)
)

.trial,.outcome,.successes
<dbl>,<list>,<int>
1,"16, 10, 7",0
2,"11, 8, 16",0
3,"15, 18, 4",0
4,"10, 13, 17",0
5,"2, 6, 20",1
6,"11, 1, 18",0
7,"20, 4, 18",1
8,"20, 1, 2",1
9,"6, 7, 12",0
10,"17, 4, 18",0


**Step 3 - Determine whether there is at least one 19.**

Since our goal is to determine whether or note there is at least one roll of 19 or more, we use `mutate` to create a new Boolean column containing the answer.

In [34]:
(replicate(10, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, at.least.19)
 %>% mutate(at.least.one.success = .successes >= 1)
)

.trial,.outcome,.successes,at.least.one.success
<dbl>,<list>,<int>,<lgl>
1,"6, 20, 19",2,True
2,"5, 14, 1",0,False
3,"20, 7, 15",1,True
4,"12, 1, 15",0,False
5,"14, 15, 19",1,True
6,"3, 19, 11",1,True
7,"11, 18, 2",0,False
8,"9, 18, 5",0,False
9,"20, 18, 13",1,True
10,"15, 8, 13",0,False


**Step 4 - Estimate probility of the Boolean column.**

When computing probabilities for a Boolean column, we generally are only interesting the proportion of `TRUE` values.  We can make this estimate using the `estimate_prob` function on the column of interest.

In [35]:
# prototype first
N <- 100
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, at.least.19)
 %>% mutate(at.least.one.success = .successes >= 1)
 %>% estimate_prob(at.least.one.success)
)

at.least.one.success
<dbl>
0.25


In [36]:
# estimate with MANY trials
N <- 100000
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, at.least.19)
 %>% mutate(at.least.one.success = .successes >= 1)
 %>% estimate_prob(at.least.one.success)
)

at.least.one.success
<dbl>
0.26756


## <font color="red"> Exercise 2 </font>

Suppose that we have a batch of 100 products that contain 12 defective units.  We draw three products out at random and determine whether or not they are defective.

**Questions.**
1. What is the probability that exactly one product is defective?
2. What is the probability that more than two products are defective?

In [None]:
# Your code here

### Trick 2.  Splitting the outcomes to answer questions about position.

You can set the `.reshape = 'split'` option in `replicate` to split the outcomes into separate columns, which is useful when
1. Answering questions related to order/position.
2. Tabulating a single marginal distribution.

To illustrate, we return to the earlier urn problem.

In [39]:
urn <- c(rep('B', 5), rep('W', 3))
urn

In [40]:

replicate(10, sample(urn, 2, replace = FALSE))

.trial,.outcome
<dbl>,<list>
1,"B, B"
2,"B, B"
3,"B, W"
4,"W, W"
5,"B, W"
6,"W, B"
7,"B, B"
8,"B, W"
9,"B, W"
10,"B, B"


#### Splitting the outcomes

Let's split the outcomes into separate columns using `.reshape = 'split'`.  Note the the result now contains two columns, one for each draw.

In [41]:
replicate(10, sample(urn, 2, replace = FALSE), .reshape = 'split')

.trial,.outcome.1,.outcome.2
<dbl>,<chr>,<chr>
1,W,B
2,W,B
3,W,B
4,W,W
5,B,B
6,B,W
7,B,B
8,W,B
9,W,B
10,B,W


#### Task 1. Estimate the distribution of the second draw from the urn.

Splitting the outcomes can be paired with `tabulate` to estimate the marginal distribution for any one draw.

In [42]:
N <- 100
(replicate(10, sample(urn, 2, replace = FALSE), .reshape = 'split')
  %>% tabulate(.outcome.2)
)

X = B,X = W
<dbl>,<dbl>
0.8,0.2


In [43]:
N <- 10000
(replicate(10, sample(urn, 2, replace = FALSE), .reshape = 'split')
 %>% tabulate(.outcome.2)
)

X = B,X = W
<dbl>,<dbl>
0.8,0.2


#### Task 2. Estimate both the chance that the draws are the same color and the chance the second draw is white.
Note that both questions are related the position/order of the draws, which suggests would should split the outcomes.  Since we are asked to answer two questions, we will
1. Create a Boolean column for the event in each, and
2. Use `estimate_all_prob` to estimate the probabilities for all Boolean columns.




**Step 1 - Use `mutate` to create a Boolean column for each event.**

In [47]:
N <- 100
(replicate(10, sample(urn, 2, replace = FALSE), .reshape = 'split')
  %>% mutate(same.color = .outcome.1 == .outcome.2,
             second.blue = .outcome.2 == 'B',
            )
)

.trial,.outcome.1,.outcome.2,same.color,second.blue
<dbl>,<chr>,<chr>,<lgl>,<lgl>
1,B,B,True,True
2,W,B,False,True
3,B,B,True,True
4,B,W,False,False
5,B,B,True,True
6,B,B,True,True
7,W,B,False,True
8,W,W,True,False
9,W,B,False,True
10,W,B,False,True


**Step 2 - Use `estimate_all_prob`**

In [55]:
N <- 100
(replicate(N, sample(urn, 2, replace = FALSE), .reshape = 'split')
  %>% mutate(same.color = .outcome.1 == .outcome.2,
             second.blue = .outcome.2 == 'B',
            )
  %>% estimate_all_prob
)

same.color,second.blue
<dbl>,<dbl>
0.51,0.64


In [54]:
N <- 100000
(replicate(N, sample(urn, 2, replace = FALSE), .reshape = 'split')
  %>% mutate(same.color = .outcome.1 == .outcome.2,
             second.blue = .outcome.2 == 'B',
            )
  %>% estimate_all_prob
)

same.color,second.blue
<dbl>,<dbl>
0.46472,0.62327


### Trick 3. Stacking outcomes to tabulate all the marginal distributions together.

The other option of reshaping the trials is to stack the outcomes using `.reshape = 'stack'`.  To see this in action, let's stack the results of drawing four chips from our previous urn.

In [56]:
urn

In [57]:
sample(urn, 4, replace = FALSE)

In [61]:
replicate(5, sample(urn, 4, replace = FALSE))

.trial,.outcome
<dbl>,<list>
1,"W, B, W, B"
2,"B, B, W, B"
3,"B, B, W, B"
4,"W, B, B, B"
5,"W, W, B, B"


We can stack the outcomes using `.reshape = 'stack'`.  This action introduces a new column--titled `.replication` by default--that contains the order of the draws for each trial.

In [62]:
replicate(5, sample(urn, 4, replace = FALSE), .reshape = 'stack')

.trial,.replication,.outcome
<dbl>,<int>,<chr>
1,1,W
1,2,B
1,3,W
1,4,B
2,1,B
2,2,W
2,3,B
2,4,B
3,1,B
3,2,B


#### Task - Estimate the marginal distributions for drawing four chips from our urn.

To estimate the distribution for all the draws simultaneously, we will
1. Stack the outcomes,
2. `group_by` the replications, and
3. `tabulate` the outcomes.

The result is a two-way table of marginal probabilities.

In [67]:
# prototype
N <- 100
(replicate(N, sample(urn, 4, replace = FALSE), .reshape = 'stack')
 %>% group_by(.replication)
 %>% tabulate(.outcome)
)

[1m[22mAdding missing grouping variables: `.replication`


.replication,X = B,X = W
<int>,<dbl>,<dbl>
1,0.66,0.34
2,0.61,0.39
3,0.66,0.34
4,0.61,0.39


In [72]:
# Get good estimates
N <- 100000
(replicate(N, sample(urn, 4, replace = FALSE), .reshape = 'stack')
 %>% group_by(.replication)
 %>% tabulate(.outcome)
)

[1m[22mAdding missing grouping variables: `.replication`


.replication,X = B,X = W
<int>,<dbl>,<dbl>
1,0.6283,0.3717
2,0.62311,0.37689
3,0.62436,0.37564
4,0.62224,0.37776
