<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/1_2_estimating_conditional_probabilites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(dplyr)
library(tidyr)
library(purrr)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

# Estimating conditional probabilites

In this lecture, we will demonstrate how to
1. How to construct and compute the conditional probability of one event based on another.
2. Use split, group, and tabulate to compute the distribution of one variable conditioned on another.

In [None]:
## Estimating conditional distributions of one variable given another.

First, we will look at how `group_by` combined with `tabulate` can be used to calculate conditional distributions by solving the following example.

## <font color="red"> Exercise 1.2.1 </font>
Consider the experiment of rolling a 6-sided die twice.  Let $X$ represent the sum of the rolls and $Y$ represent the maximum of the two outcomes.  Estimate $P(X\le 4 | Y = 4)$

In [2]:
# Your code here

## Technique 1 - Estimating Conditional Proability for Events

First, suppose that we have defined two events $A$ and $B$ and we want to estimate the probility of $P(A|B)$.  We can perform this task by
1. Using some combination of split and `mutate` to compute a Boolean column for each event.
2. Either `filter` or `group_by` the given event.
3. Use `estimate_prob` on the conditional event of interest.

Let's illustrate will a problem involving rolling a die.

### Example 1.  Rolling a 20-sided die.

Suppose we roll a fair 20-sided die two three times and define
* $A$ as the event that we roll at least one number 18 or larger, and
* $B$ as the event that the average of the three rolls is larger than 10.

**Task.** Estimate $P(B|A)$


#### Step 1. Set up the simulation

In [27]:
(die <- 1:20)

In [29]:
sample(die, 3 , replace = TRUE)

In [30]:
replicate(10, sample(die, 3, replace = TRUE))

.trial,.outcome
<dbl>,<list>
1,"18, 6, 14"
2,"3, 18, 3"
3,"5, 3, 18"
4,"8, 8, 20"
5,"14, 19, 14"
6,"10, 13, 1"
7,"2, 20, 3"
8,"18, 7, 20"
9,"12, 7, 6"
10,"14, 4, 10"


#### Step 2a.  Create a column for event $A$

Note that this task is easier if we first count the number of rolls that are 18 or larger.  This is a compound event, so we will pass a helper funciton to `col_num_successes`.  Finally, we can use `mutate` to create the event column for $A$.

In [33]:
N <- 10
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
)


.trial,.outcome,.successes,A
<dbl>,<list>,<int>,<lgl>
1,"19, 13, 14",1,True
2,"9, 20, 14",1,True
3,"1, 12, 2",0,False
4,"3, 20, 6",1,True
5,"8, 13, 18",1,True
6,"13, 14, 17",0,False
7,"12, 1, 4",0,False
8,"6, 4, 15",0,False
9,"5, 7, 12",0,False
10,"13, 2, 3",0,False


#### Step 2b.  Construct a Boolean column for event $B$

Simlar to the last step, it is benificial to first compute the average, then construct the Boolean column.

Normally, applying `mean` in `mutate` will compute the average of the whole column, but we want the average of each element.  This is accomplished by mapping the `mean` function onto each individual list, as follows.  Since we will be expecting floating point output, we will use `map_dbl` to specify the output type.


N <- 10
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
)

In [37]:
N <- 10
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
 %>% mutate(avg.roll = map_dbl(.outcome, \(x) mean(x)))
)

.trial,.outcome,.successes,A,avg.roll
<dbl>,<list>,<int>,<lgl>,<dbl>
1,"4, 14, 13",0,False,10.333333
2,"13, 14, 7",0,False,11.333333
3,"2, 1, 7",0,False,3.333333
4,"9, 18, 18",2,True,15.0
5,"17, 12, 20",1,True,16.333333
6,"5, 20, 5",1,True,10.0
7,"4, 1, 13",0,False,6.0
8,"3, 10, 3",0,False,5.333333
9,"18, 1, 13",1,True,10.666667
10,"7, 11, 10",0,False,9.333333


Now we can compute the Boolean column for $B$, by asking when the average is bigger than 10.

In [38]:
N <- 10
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
 %>% mutate(avg.roll = map_dbl(.outcome, \(x) mean(x)))
 %>% mutate(B = avg.roll > 10)
)

.trial,.outcome,.successes,A,avg.roll,B
<dbl>,<list>,<int>,<lgl>,<dbl>,<lgl>
1,"15, 11, 14",0,False,13.333333,True
2,"14, 14, 13",0,False,13.666667,True
3,"3, 9, 17",0,False,9.666667,False
4,"2, 7, 2",0,False,3.666667,False
5,"9, 16, 19",1,True,14.666667,True
6,"18, 9, 8",1,True,11.666667,True
7,"3, 2, 15",0,False,6.666667,False
8,"20, 7, 20",2,True,15.666667,True
9,"7, 14, 8",0,False,9.666667,False
10,"3, 6, 3",0,False,4.0,False


#### Step 3.  Either `filter` or `group_by` the given event, then estimate the conditional event using `estimate_prob`

In [39]:
# Option 1 - Filter by given event
N <- 10
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
 %>% mutate(avg.roll = map_dbl(.outcome, \(x) mean(x)))
 %>% mutate(B = avg.roll > 10)
 %>% filter(A)
 %>% estimate_prob(B)
)

B
<dbl>
0.75


In [40]:
# Option 2 - Group by given event
N <- 10
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
 %>% mutate(avg.roll = map_dbl(.outcome, \(x) mean(x)))
 %>% mutate(B = avg.roll > 10)
 %>% group_by(A)
 %>% estimate_prob(B)
)

A,B
<lgl>,<dbl>
False,0.4285714
True,0.6666667


#### Step 4. Increase the number of trials to get a good estimate.

In [42]:

# Option 1 - Filter by given event
N <- 10000
(replicate(N, sample(die, 3, replace = TRUE))
 %>% col_num_successes(.outcome, \(x) x >= 18)
 %>% mutate(A = .successes >= 1)
 %>% mutate(avg.roll = map_dbl(.outcome, \(x) mean(x)))
 %>% mutate(B = avg.roll > 10)
 %>% filter(A)
 %>% estimate_prob(B)
)

B
<dbl>
0.8433393


### Technique 2. Group and tabulate to estimate the whole conditional distributions.

Suppose that we want to estimate the conditional distribution of some random variable $X$ given another random variable $Y$.  This can be accomplished by

1. Creating columns for $X$ and $Y$ using some combination of splitting outcomes and `mutate`.
2. Grouping on the given variable $Y$ using `group_by(Y)`
3. Using `tabulate(X)` to get a summary table of conditional distributions with one row/distribution per value of $Y$.

We will illustrate with an urn problem.

### Example 1 - Drawing clips from an urn

Suppose we have an urn containing 5 blue and 2 white chips, and we draw three chips at random without replacment.

**Task.** Estimate the distribution of the third draw given the color of the first draw.

#### Step 1. Set up the simulation for a small number of trials

In [12]:
# Sample space
urn <- c(rep('B', 5), rep('W', 2))
urn

In [13]:
# Generate one trial
sample(urn, 3, replace = FALSE)

In [14]:
# Generate 10 trials and split
replicate(10, sample(urn, 3, replace = FALSE), .reshape = 'split')

.trial,.outcome.1,.outcome.2,.outcome.3
<dbl>,<chr>,<chr>,<chr>
1,W,B,B
2,B,B,W
3,B,B,W
4,B,B,B
5,B,B,W
6,B,B,B
7,B,B,W
8,B,B,B
9,B,W,B
10,B,B,B


Note that the task involves the order/location of the draws, which is why we split the outcomes.

#### Step 2. Use `group_by` and `tabulate` to estimate the conditional distribution.

**Task.** Estimate the distribution of the third draw given the color of the first draw.

Here we will `group_by` the first outcome/color (given) and `tabulate` the third outcome/color (conditional variable of interest).

In [15]:
# prototype with a small N
N <- 100
(replicate(N, sample(urn, 3, replace = FALSE), .reshape = 'split')
 %>% group_by(.outcome.1)
 %>% tabulate(.outcome.3)
)

[1m[22mAdding missing grouping variables: `.outcome.1`


.outcome.1,X = B,X = W
<chr>,<dbl>,<dbl>
B,0.6478873,0.3521127
W,0.8965517,0.1034483


In [17]:
# Good estimates with a large N
N <- 100000
(replicate(N, sample(urn, 3, replace = FALSE), .reshape = 'split')
 %>% group_by(.outcome.1)
 %>% tabulate(.outcome.3)
)


[1m[22mAdding missing grouping variables: `.outcome.1`


.outcome.1,X = B,X = W
<chr>,<dbl>,<dbl>
B,0.6659521,0.3340479
W,0.8324136,0.1675864


### Understanding the output

Note that the given/grouped variable has one entry per row, and the conditional variable has one column per value.  This means that **each row represents the conditional distribution for the given value of that row.**

Consider the following results.

|.outcome.1 |	X = B	    | X = W|
|-----------|-----------|------|
|B	        |0.6659521	|0.3340479|
|W	        |0.8324136	|0.1675864|

We see that, for example, $P(X = B | Y = W)\approx 0.1676$



## <font color="red"> Exercise 1.2.2 </font>
Consider the experiment of rolling a 6-sided die twice.  Let $X$ represent the sum of the rolls and $Y$ represent the maximum of the two outcomes.  Estimate the conditional distributions of $X$ for each given value of $Y$.

In [1]:
# Your code here

## Aside - Debugging a pipe

You probably noticed that I organize my pipes by
1. wrapping the pipe in parentheses, and
2. Putting the `%>%` operator at the start of each line.

While this is not the standard style for pipes in `R`, I believe this approach makes them easier to debug.  

### Two techniues for debugging a pipe

I use two main approaches when debugging a pipe.
1. Commenting out later steps to inspect previous results.
2. Using `walk(str)` to take a peek at intermediate values.

Let's illustrate on our previous pipe.

#### Commenting out previous results.

This approach involves commenting out all both the first step, then repeating the following steps until we have correct code.
1. Inspect the current output and debug as needed
2. Uncomment the next line and repeat.

Note that most platforms allow you to use `COMMAND/CONTROL + /` to comment the current line or selection.

Let's illustrate the process

#### 1. Verify the first line

Be sure to reduce the number of trials to make the output managable

In [21]:
N <- 5
(replicate(N, sample(urn, 3, replace = FALSE), .reshape = 'split')
#  %>% group_by(.outcome.1)
#  %>% tabulate(.outcome.3)
)


.trial,.outcome.1,.outcome.2,.outcome.3
<dbl>,<chr>,<chr>,<chr>
1,B,B,B
2,W,B,B
3,B,B,W
4,B,B,W
5,B,B,W


#### 2. Uncomment and verify the second line

In [23]:
N <- 5
(replicate(N, sample(urn, 3, replace = FALSE), .reshape = 'split')
 %>% group_by(.outcome.1)
#  %>% tabulate(.outcome.3)
)

.trial,.outcome.1,.outcome.2,.outcome.3
<dbl>,<chr>,<chr>,<chr>
1,B,W,B
2,B,B,B
3,B,W,B
4,B,B,B
5,B,B,B


#### 3. Uncomment verify the last line

Once you are confident that your code is correct, you can move to a large number of trials.

In [24]:
N <- 5
(replicate(N, sample(urn, 3, replace = FALSE), .reshape = 'split')
 %>% group_by(.outcome.1)
 %>% tabulate(.outcome.3)
)

[1m[22mAdding missing grouping variables: `.outcome.1`


.outcome.1,X = B,X = W
<chr>,<dbl>,<dbl>
B,0.5,0.5
W,1.0,0.0


### Technique 2 - Inspecting intermediate results using `walk(str)`

Sometimes, especially when aggregating, it is nice to be able to peek at an intermediate result.  The easiest way to do this is by inserting a temporary `walk(str)` into out pipe.

Let's run the last pipe again, but peek and the table before the group and tabulate steps

In [25]:

(replicate(N, sample(urn, 3, replace = FALSE), .reshape = 'split')
 %>% walk(str)
 %>% group_by(.outcome.1)
 %>% tabulate(.outcome.3)
)

 num [1:5] 1 2 3 4 5
 chr [1:5] "W" "B" "B" "B" "B"
 chr [1:5] "B" "B" "B" "W" "B"
 chr [1:5] "B" "B" "W" "B" "W"


[1m[22mAdding missing grouping variables: `.outcome.1`


.outcome.1,X = B,X = W
<chr>,<dbl>,<dbl>
B,0.5,0.5
W,1.0,0.0


#### Understanding the output

When using `walk(str)` the structure of the intermediate data gets printed first, then we see the result of the entire computations.  Note that `walk` does not change the data flowing through the pipeline, and instead passes on the result after performing some side effect, in this case printing the structure.

In [None]:
## Aside - Debugging a pipe

You probably noticed that I organize my pipes by
1. wrapping the pipe in