<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/2_2_simulating_the_geometric_and_negative_binomial_distributions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
library(dplyr)
library(tidyr)
library(purrr)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: usethis

Downloading GitHub repo yardsale8/purrrfect@HEAD




[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/RtmpaTgriI/remotes15c481d9c8c/yardsale8-purrrfect-d91fae7/DESCRIPTION’ ... OK
* preparing ‘purrrfect’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘purrrfect_1.0.1.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘purrrfect’


The following objects are masked from ‘package:base’:

    replicate, tabulate




# Simulating the Geometric Distribution

In this notebook, we will apply what we've learned to problems related to the two distributions related to the binomial distribution, namely the geometric and negative binomial distributions.  We will do this by

1. Use the `sample_until` function to simulate wait-time problems.
2. Solve problems related to the geometric an negative binomial distributions by simulating the raw outcomes using `sample_unit` and converting to the number of success, as well as
3. Simulating the number of successes directly using `rgeom` and `rnbinom`.

Finally, we will again illustrate how to use simulations to estimate the expected value.

### Review - Bernoulli Process

A simple outcomes is generated by a [Bernoulli process](https://en.wikipedia.org/wiki/Bernoulli_process), provided
1. Outcomes are independent,
2. There are two possible outcomes (denoted success and failure), and
3. The probability of a success is contant.

### The geometric distribution

Suppose our experiment involves continually generating outcomes using a Bernoulli process until we see the first success.  There are two formulations for the [geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution), given below.

* Let $X$ is the number of failures observed before the first success, and
* Define $Y$ as the total number of trials until the first success.

Both random variable can be referred to as having a geometric distribution, but with different formulae (PMF, mean, var, etc.).

**We will use $X$ (number of failures) since that is what R uses.**

## Strategies for simulating the geometric distribution

1. Simulate the raw Bernoulli trials using `sample_until`, then transform into the number of success using either reshaping or `mutate` + `map`.
2. Simulate the number of successes directly using `rbinom`.

### Using `sample_until` to simulate waiting for a single success

The `purrrfect` library provides the `sample_until` function, which is similar to the base `sample` function, but also includes a stopping condition.  The stopping condition can either be a literal value or a function in the `.p` argument.

**Signature:** `sample_until(x, .p, replace = FALSE, prob = NULL, .halt = 1000)`

In [3]:
?sample_until

### Example 1 - Flip a fair coin until the first head

In this case, the stopping condition can be the literal value `'H'`

In [15]:
coin <- c('H', 'T')
sample_until(coin, 'H', replace = TRUE)

In [16]:
replicate(10, sample_until(coin, 'H', replace = TRUE))

.trial,.outcome
<dbl>,<list>
1,"T, T, T, H"
2,H
3,"T, H"
4,H
5,"T, T, T, T, H"
6,"T, H"
7,"T, T, T, T, H"
8,H
9,"T, H"
10,H


### Example 2 - Roll a fair die until we get a value of 4 or more.

In this case, where the set of successes is a compound event, we need to write two predicate functions
1. One that determine if a single roll is a success, and
2. A second that determines if the sample contains exactly one success.

In [24]:
is.success <- \(x) x >= 4
one.success <- \(x) num_successes(x, is.success) == 1
sample_until(1:6, one.success, replace = TRUE)

In [26]:
replicate(10, sample_until(1:6, one.success, replace = TRUE))

.trial,.outcome
<dbl>,<list>
1,5
2,5
3,6
4,6
5,"3, 3, 1, 2, 3, 5"
6,6
7,6
8,"3, 1, 3, 3, 2, 5"
9,5
10,6


### Example - LeBron shoots even more free throws

LeBron James is NBA player with a stored career.  Over the course of his career, he has made 73.5% of all free throw attempts.  Suppose that LeBron plans to keep attempting free throws until he makes one shot.  We want to model the number of missed shots before he makes his shot.

We will illustrate this process in three ways,

1. Simulating individual shots and reshaping to compute number of successes,
2. Simulating individual shots and using `mutate` + `map` to compute number of successes, and
3. Simulating the number of successing directly using `rbinom`

#### Setting up a sample space

We need to sample from a space with a 73.5% chance of a made free throw, which will be accomplished using a probability vector

In [28]:
shot <- c('Make', 'Miss')
shot.probs <- c(0.735, 1 - 0.735)

In [29]:
replicate(10, sample_until(shot, 'Make', replace = TRUE, prob = shot.probs))

.trial,.outcome
<dbl>,<list>
1,Make
2,"Miss, Make"
3,Make
4,Make
5,Make
6,Make
7,Make
8,Make
9,"Miss, Make"
10,Make


#### Approach 1 - Simulate individual shots, reshape, and count trials

Comment out lines to explore the output in each step.

In [44]:
N <- 10
(replicate(10, sample_until(shot, 'Make', replace = TRUE, prob = shot.probs), .reshape='stack')
 %>% group_by(.trial)
 %>% summarise(total.trials = n())
 %>% mutate(num.failures = total.trials - 1)
 )

.trial,total.trials,num.failures
<dbl>,<int>,<dbl>
1,2,1
2,1,0
3,1,0
4,2,1
5,1,0
6,2,1
7,3,2
8,1,0
9,1,0
10,2,1


#### Approach 2 - Simulate individual shots and count trials by mapping `length`

Comment out lines to explore the output in each step.

In [45]:
N <- 10
(replicate(10, sample_until(shot, 'Make', replace = TRUE, prob = shot.probs))
 %>% mutate(total.shots = map_int(.outcome, length),
            num.failures = total.shots - 1)
 )

.trial,.outcome,total.shots,num.failures
<dbl>,<list>,<int>,<dbl>
1,"Miss, Make",2,1
2,Make,1,0
3,Make,1,0
4,Make,1,0
5,Make,1,0
6,Make,1,0
7,"Miss, Make",2,1
8,Make,1,0
9,Make,1,0
10,Make,1,0


#### Approach 3 - Simulate the number of succeses directly using `rgeom`

Note that the signature of `rgeom` is `rgeom(n, prob)` (use help!).  Here will use `n = 1` to mean one experiment per row and `prob = 0.735`

**Important.** As noted in the help for `rgeom`, the returned value `x` is the number of failures before he first success.

In [46]:
?rgeom

In [41]:
N <- 10
(replicate_int(N, rgeom(1, 0.735))
 %>% mutate(num.failures = .outcome,
            total.shots = num.failures + 1)
)

.trial,.outcome,num.failures,total.shots
<dbl>,<int>,<int>,<dbl>
1,0,0,1
2,3,3,4
3,1,1,2
4,0,0,1
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,0,1
10,0,0,1


## Estimating Probabilities

Once we have the column of containing the number of success, we can answer probability questions per usual, e.g., make a Boolean column then use `estimate_prob` or `estimate_all_prob`.

## Estimating the Distribution

Now that we know how to simulate a binomial random variable, we can use `tabulate` to estimate the values of the probability mass function.

## The Negative Binomial Distribution

The [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution) is a generalization of the geometric distributions, where we wait for $k$ successes before stopping.  Agian, the random variable can be either

* Let $X$ is the number of failures observed before the $k$th success, and
* Define $Y$ as the total number of trials until the $k$th success.

Both random variable can be referred to as having a negative binomial distribution, but with different formulae (PMF, mean, var, etc.).

**We will use $X$ (number of failures) since that is what R uses.**



## Simulating the Expected Value

So far, we have used our simulating to estimate the answers to probability questions/distributions. Simulations can also be used to simulate the various expected values.  Here are three common types of expectation problems and the associated simulation strategy.

1. **Mean/Expected Value.** Simulate $X$ $\longrightarrow$ summarise using `mean`.
2. **Variance/Standard Deviation.** Simulate $X$ $\longrightarrow$ summarise using `var` or `sd` functions.
3. **Expected Value of a function.** Simulate $X$ $\longrightarrow$ use `mutate` to compute an `f(X)` column $\longrightarrow$ summarise using `mean`.

#### Example 1 - LeBron's mean, variance, and standard deviation number of shots

Again, we have LeBron--a 73.5% free throw shooter--shooting 10 free throws.  Let's estimate the mean, variance, and standard deviation of the number of made shots.  I will use `rbinom` to directly simulate the binomial trials.


In [None]:
n <- 10
p <- 0.735
N <- 10000
( replicate_int(N, rbinom(1, n, p))
  %>% summarise(approx.mu = mean(.outcome),
                approx.var = var(.outcome),
                approx.sd = sd(.outcome))

)

approx.mu,approx.var,approx.sd
<dbl>,<dbl>,<dbl>
7.3398,1.991135,1.411076


#### Comparing to the theortical values

Next, we can use `mutate` to compute the actual theoretical values and use `relocate` to put the respective values next to each other.

In [None]:
n <- 10
p <- 0.735
N <- 100000
( replicate_int(N, rbinom(1, 10, 0.735))
  %>% summarise(approx.mu = mean(.outcome),
                approx.var = var(.outcome),
                approx.sd = sd(.outcome))
  %>% mutate(actual.mu = n*p,
             actual.var = n*p*(1-p),
             actual.sd = sqrt(actual.var))
  %>% relocate(actual.mu, .after = approx.mu)
  %>% relocate(actual.var, .after = approx.var)
  %>% relocate(actual.sd, .after = approx.sd)
)



approx.mu,actual.mu,approx.var,actual.var,approx.sd,actual.sd
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.35553,7.35,1.946488,1.94775,1.395166,1.395618


### <font color="red"> Exercise 2.1.1 - Binomial probilities </font>

Historically circuit boards used in the manufacture of video game consoles are defective 5% of the time. Let $X$ represent the number of defective boards in a random sample of size $n=25$.  

1. Explain why we should use the binomial distribution here.
2. Find $P(X≥5)$ using each of the three approaches discussed above.
3. Use  `tabulate` to estimate the PMF of $X$.
4. Use a simulation to estimate $E(X)$, and $SD(X)$.

