<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/2_1_simulating_the_binomial_distribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(dplyr)
library(tidyr)
library(purrr)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

# Simulating the Binomial Distribution

In this notebook, we will apply what we've learned to problems related to the binomial distribution.  We will do this by

1. Simulating the raw outcomes using `sample` and converting to the number of success, as well as
2. Simulating the number of successes directly using `rbinom`.

Finally, we will illustrate how to use simulations to estimate the expected value.

### Review - Bernoulli Process

A simple outcomes is generated by a [Bernoulli process](https://en.wikipedia.org/wiki/Bernoulli_process), provided
1. Outcomes are independent,
2. There are two possible outcomes (denoted success and failure), and
3. The probability of a success is contant.

### The binomial distribution

Suppose that we are generating $n$ outcomes from a Bernoulli process with the probability of success given by $p$.  If $X$ is the number of successes in the $n$ trials, then $X$ will have a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution)

## Strategies for simulating the binomial distribution

1. Simulate the raw Bernoulli trials, then transform into the number of success using either reshaping or `mutate` + `map`.
2. Simulate the number of successes directly using `rbinom`.

### Example - LeBron shoots free throws

LeBron James is NBA player with a stored career.  Over the course of his career, he has made 73.5% of all free throw attempts.  Suppose that we want to model the number of shots he makes in 10 attempts using the binomial distribution.

We will illustrate this process in three ways,

1. Simulating individual shots and reshaping to compute number of successes,
2. Simulating individual shots and using `mutate` + `map` to compute number of successes, and
3. Simulating the number of successing directly using `rbinom`

#### Setting up a sample space

We need to sample from a space with a 73.5% chance of a made free throw, which will be accomplished using a probability vector

In [3]:
shot <- c('✓', '𐄂')
shot.probs <- c(0.735, 1 - 0.735)
replicate(10, sample(shot, 10, replace = TRUE, prob = shot.probs))

.trial,.outcome
<dbl>,<list>
1,"✓, ✓, ✓, ✓, ✓, ✓, ✓, ✓, ✓, ✓"
2,"✓, ✓, 𐄂, ✓, ✓, ✓, ✓, 𐄂, 𐄂, ✓"
3,"𐄂, 𐄂, ✓, ✓, ✓, ✓, 𐄂, ✓, ✓, ✓"
4,"✓, ✓, ✓, 𐄂, 𐄂, ✓, ✓, 𐄂, ✓, ✓"
5,"✓, ✓, 𐄂, ✓, ✓, ✓, ✓, ✓, ✓, 𐄂"
6,"𐄂, 𐄂, ✓, 𐄂, ✓, ✓, ✓, ✓, ✓, ✓"
7,"✓, ✓, 𐄂, ✓, 𐄂, ✓, ✓, ✓, ✓, ✓"
8,"𐄂, ✓, ✓, ✓, 𐄂, ✓, ✓, ✓, 𐄂, ✓"
9,"✓, ✓, ✓, ✓, ✓, ✓, 𐄂, ✓, 𐄂, ✓"
10,"𐄂, ✓, ✓, ✓, ✓, ✓, ✓, ✓, ✓, ✓"


#### Approach 1 - Simulate individual shots, reshape, and compute successes.

Comment out lines to explore the output in each step.

In [4]:
N <- 10
(replicate(N, sample(shot, 10, replace = TRUE, prob = shot.probs), .reshape = 'stack')
 %>% mutate(is.success = ifelse(.outcome == '✓', 1, 0))
 %>% group_by(.trial) %>% summarise(num.successes = sum(is.success))
 )

.trial,num.successes
<dbl>,<dbl>
1,8
2,6
3,6
4,4
5,6
6,9
7,8
8,7
9,10
10,7


#### Approach 2 - Simulate individual shots, recode outcomes, and compute successes using `mutate` and `map`

Comment out lines to explore the output in each step.

In [5]:
N <- 10
(replicate(N, sample(shot, 10, replace = TRUE, prob = shot.probs))
 %>% mutate(is.success = map(.outcome, \(x) ifelse(x == '✓', 1, 0)))
 %>% mutate(num.successes = map_int(is.success, sum))
 )

.trial,.outcome,is.success,num.successes
<dbl>,<list>,<list>,<int>
1,"✓, ✓, ✓, ✓, 𐄂, ✓, ✓, ✓, ✓, ✓","1, 1, 1, 1, 0, 1, 1, 1, 1, 1",9
2,"✓, ✓, ✓, 𐄂, ✓, ✓, ✓, ✓, ✓, 𐄂","1, 1, 1, 0, 1, 1, 1, 1, 1, 0",8
3,"✓, 𐄂, 𐄂, ✓, ✓, ✓, ✓, ✓, ✓, 𐄂","1, 0, 0, 1, 1, 1, 1, 1, 1, 0",7
4,"𐄂, ✓, 𐄂, 𐄂, ✓, ✓, 𐄂, ✓, ✓, ✓","0, 1, 0, 0, 1, 1, 0, 1, 1, 1",6
5,"✓, 𐄂, ✓, ✓, 𐄂, 𐄂, ✓, ✓, ✓, ✓","1, 0, 1, 1, 0, 0, 1, 1, 1, 1",7
6,"𐄂, ✓, 𐄂, ✓, ✓, ✓, ✓, 𐄂, ✓, ✓","0, 1, 0, 1, 1, 1, 1, 0, 1, 1",7
7,"✓, ✓, ✓, ✓, ✓, 𐄂, ✓, ✓, ✓, ✓","1, 1, 1, 1, 1, 0, 1, 1, 1, 1",9
8,"✓, 𐄂, ✓, ✓, ✓, ✓, ✓, 𐄂, 𐄂, ✓","1, 0, 1, 1, 1, 1, 1, 0, 0, 1",7
9,"✓, ✓, 𐄂, ✓, ✓, ✓, ✓, 𐄂, ✓, ✓","1, 1, 0, 1, 1, 1, 1, 0, 1, 1",8
10,"✓, 𐄂, 𐄂, ✓, 𐄂, 𐄂, ✓, ✓, ✓, ✓","1, 0, 0, 1, 0, 0, 1, 1, 1, 1",6


#### Approach 3 - Simulate the number of succeses directly using `rbinom`

Note that the signature of `rbinom` is `rbinom(n, size, prob)` (use help!).  Here will use `n = 1` to mean one experiment per row, `size = 10` to represent the attempts per trial/experiment, and `prob = 0.735`

**Important.** Be sure to use `replicate_int` to get a simple integer `.outcome` column.

In [6]:
N <- 10
(replicate_int(N, rbinom(1, 10, 0.735))
)

.trial,.outcome
<dbl>,<int>
1,9
2,5
3,7
4,7
5,7
6,9
7,5
8,9
9,8
10,7


## Three Types of Simulation Tasks

When performing a simulation, we will generally be trying to

1. Estimate the probability of an event,
2. Estimate the cut off for some region of the distribution (i.e., inverse probability), or
3. Estimate all the values for the probability mass function.

In this section, we will illustrate how to peform each task after computing the number of successes.

### Estimating Probabilities

Once we have the column of containing the number of success, we can answer probability questions per usual, e.g., make a Boolean column then use `estimate_prob` or `estimate_all_prob`.  Finally, we can compare the estimate to the exact value using, for example, `sum` and `dbinom`.

**Example Question.** What the the probability that LeBron will make 9 or more of the 10 free throw attempts?

In [10]:
N <- 100000
(replicate_int(N, rbinom(1, 10, 0.735))
 %>% mutate(nine.or.more = .outcome >= 9)
 %>% estimate_all_prob
 %>% mutate(exact.prob = sum(dbinom(9:10, 10, 0.735)))
)


nine.or.more,exact.prob
<dbl>,<dbl>
0.213,0.2119067


## Estimating the Distribution

Now that we know how to simulate a binomial random variable, we can use `tabulate` to estimate the values of the probability mass function.

In [8]:
n <- 10
p <- 0.735
N <- 100000
(replicate_int(N, rbinom(1, n, p))
 %>% tabulate(.outcome)
)

X = 1,X = 10,X = 2,X = 3,X = 4,X = 5,X = 6,X = 7,X = 8,X = 9
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5e-05,0.04728,0.00064,0.00415,0.02146,0.0719,0.16241,0.25834,0.26629,0.16748


## Working with Quantiles of a Discrete Distribution

<img src="https://beta.boost.org/doc/libs/1_36_0/libs/math/doc/sf_and_dist/graphs/binomial_pdf_3.png" width="600">

The **quantile** for a given value $0<=p<=1$ is the largest number $x$ such that $P(X <= x) <= p$. This means that the right tail for $x$ will generally be larger than $1-p$, which means we need to make an adjustment in this case.  Here is our general rule.

1. Use `quantile(data, p)` for questions about the left-tail cut off, but
2. Assuming the outcomes are whole numbers, use `quantile(data, p) + 1` for questions about the right tail cut off (For other distributions using the next largest feasible outcome).

More details can be found [here](https://beta.boost.org/doc/libs/1_36_0/libs/math/doc/sf_and_dist/html/math_toolkit/policy/pol_tutorial/understand_dis_quant.html) (which is also the location of the image shown above).

### Estimating the cut off for the left tail of the distribution.

We can `summarise` the number of successes using the `quantile` function to find the approximate cut off for the left tail of the distribution.

**Example Lower-Tail Task.** Find the cut off for the smallest 5% of made shots in the 10 attempts.

In [9]:
?quantile

In [5]:
n <- 10
p <- 0.735
left.tail.percent <- 0.05
N <- 100000
(replicate_int(N, rbinom(1, n, p))
 %>% summarise(x.lower.5.percent = quantile(.outcome, left.tail.percent))
)

x.lower.5.percent
<dbl>
5


In [6]:
n <- 10
p <- 0.735
N <- 100000
(replicate_int(N, rbinom(1, n, p))
 %>% tabulate(.outcome)
)

X = 1,X = 10,X = 2,X = 3,X = 4,X = 5,X = 6,X = 7,X = 8,X = 9
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5e-05,0.04565,0.00057,0.00421,0.02152,0.07133,0.16121,0.25926,0.26905,0.16715


**Example Upper-Tail Task.** Find the cut off for the largest 10% of made shots in the 10 attempts.

In [7]:
n <- 10
p <- 0.735
left.tail.percent <- 1 - 0.10
N <- 100000
(replicate_int(N, rbinom(1, n, p))
 %>% summarise(x.lower.5.percent = quantile(.outcome, left.tail.percent) + 1)
)

x.lower.5.percent
<dbl>
10


## Simulating the Expected Value

So far, we have used our simulating to estimate the answers to probability questions/distributions. Simulations can also be used to simulate the various expected values.  Here are three common types of expectation problems and the associated simulation strategy.

1. **Mean/Expected Value.** Simulate $X$ $\longrightarrow$ summarise using `mean`.
2. **Variance/Standard Deviation.** Simulate $X$ $\longrightarrow$ summarise using `var` or `sd` functions.
3. **Expected Value of a function.** Simulate $X$ $\longrightarrow$ use `mutate` to compute an `f(X)` column $\longrightarrow$ summarise using `mean`.

#### Example 1 - LeBron's mean, variance, and standard deviation number of shots

Again, we have LeBron--a 73.5% free throw shooter--shooting 10 free throws.  Let's estimate the mean, variance, and standard deviation of the number of made shots.  I will use `rbinom` to directly simulate the binomial trials.


In [None]:
n <- 10
p <- 0.735
N <- 10000
( replicate_int(N, rbinom(1, n, p))
  %>% summarise(approx.mu = mean(.outcome),
                approx.var = var(.outcome),
                approx.sd = sd(.outcome))

)

approx.mu,approx.var,approx.sd
<dbl>,<dbl>,<dbl>
7.3398,1.991135,1.411076


#### Comparing to the theortical values

Next, we can use `mutate` to compute the actual theoretical values and use `relocate` to put the respective values next to each other.

In [None]:
n <- 10
p <- 0.735
N <- 100000
( replicate_int(N, rbinom(1, 10, 0.735))
  %>% summarise(approx.mu = mean(.outcome),
                approx.var = var(.outcome),
                approx.sd = sd(.outcome))
  %>% mutate(actual.mu = n*p,
             actual.var = n*p*(1-p),
             actual.sd = sqrt(actual.var))
  %>% relocate(actual.mu, .after = approx.mu)
  %>% relocate(actual.var, .after = approx.var)
  %>% relocate(actual.sd, .after = approx.sd)
)



approx.mu,actual.mu,approx.var,actual.var,approx.sd,actual.sd
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.35553,7.35,1.946488,1.94775,1.395166,1.395618


### <font color="red"> Exercise 2.1.1 - Binomial probilities </font>

Historically circuit boards used in the manufacture of video game consoles are defective 5% of the time. Let $X$ represent the number of defective boards in a random sample of size $n=25$.  

1. Explain why we should use the binomial distribution here.
2. Find $P(X≥5)$ using each of the three approaches discussed above.
3. Find the cut-off for the lowest 10% of the distribution.
4. Use  `tabulate` to estimate the PMF of $X$.
5. Use a simulation to estimate $E(X)$, and $SD(X)$.



In [None]:
# Your code there